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1  Objectives 

Supervised  machine  learning  can  be  formulated  in  a  mathematically  rigorous  way  as  the 
problem  of  inferring  an  underlying  functional  relation  behind  data.  However,  unsupervised 
machine  learning  is  often  defined  in  an  ambiguous  way — extracting  some  useful  knowl¬ 
edge  hidden  in  the  data  without  explicit  guidance.  Nevertheless,  unsupervised  learning 
methods  such  as  clustering  and  change  detection  are  indispensable  to  various  real-world 
applications.  However,  due  to  its  vague  formulation,  studies  of  unsupervised  learning  tend 
to  be  ad-hoc,  and  thus  development  of  unsupervised  learning  methods  is  still  far  behind 
supervised  learning.  The  aim  of  this  project  is  to  overcome  this  difficulty  by  providing 
a  systematic  approach  to  a  class  of  ill-defined  unsupervised  learning  problems  based  on 
information  measures. 

Mutual  information  (MI)  is  a  standard  information  measure  that  has  been  exten¬ 
sively  explored  in  the  field  of  information  theory.  However,  MI  is  hard  to  approximate 
from  data  samples  and  it  is  not  robust  against  outliers.  In  our  project,  we  consider  an 
alternative  information  measure  called  squared-loss  mutual  information  (SMI),  which  is 
more  robust  against  outliers  by  definition.  To  develop  a  family  of  machine  learning  al¬ 
gorithms  based  on  SMI,  we  utilize  its  robust  and  computationally  efficient  approximator 
called  least-squares  mutual  information  (LSMI),  which  is  one  of  the  major  deliverables 
of  my  previous  project,  “A  Density-Ratio  Approach  to  Machine  Learning ”,  supported  by 
AFOSR/AOARD  (AOARD-09-4071).  The  usefulness  of  the  proposed  approach  is  demon¬ 
strated  through  experiments. 

2  Status  of  effort 

My  project  consists  of  two  subjects:  (A)  Development  of  information-based  machine 
learning  algorithms  and  (B)  Improvement  of  information  estimators  for  further  advances. 

For  the  subject  (A),  we  have  actively  explored  various  unsupervised  machine  learn¬ 
ing  topics  and  developed  novel  information-based  algorithms,  including  clustering,  in¬ 
dependence  testing,  object  matching,  class-imbalance  adaptation,  change  detection,  and 
canonical  dependency  analysis.  We  further  developed  methods  of  supervised  dimension 
reduction,  probabilistic  classification,  and  non-stationarity  adaptation  in  the  same  frame¬ 
work. 

For  the  subject  (B),  we  explored  novel  paradigms  for  further  improving  the  accu¬ 
racy  and  robustness  of  information  estimators,  including  information  estimation  with 
dimension  reduction  for  coping  with  high  dimensionality,  information  estimation  with 
a  relative-divergence  and  a  difference-divergence  for  enhancing  robustness  against  out¬ 
liers,  and  a  unified  framework  of  information  estimation  for  better  understanding  mutual 
relation  among  different  information  measures. 
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3  Abstract 

We  developed  various  information-based  machine  learning  algorithms: 

•  Object  matching:  Given  two  sets  of  unpaired  objects  (such  as  speech  signals  from 
two  different  subjects,  a  set  of  photos  and  a  photo  frame,  and  images  taken  from 
different  modalities),  we  pair  them  by  maximizing  their  mutual  dependency  (Pub¬ 
lication  4). 

•  Clustering:  Given  input-only  samples,  we  determine  their  cluster  labels  by  finding 
the  most  dependent  label  assignments  on  the  original  input  samples  (Publications 
5  and  14). 

•  Canonical  dependency  analysis:  Given  two  sets  of  paired  samples,  we  find  the  pro¬ 
jections  to  maximize  their  dependencies  (Publication  20). 

•  Statistical  testing:  Given  two  sets  of  samples,  we  decide  whether  they  are  drawn 
from  the  same  probability  distribution  (Publications  13  and  17).  Similarly,  given 
paired  samples,  we  decide  whether  they  are  independence  (Publication  11). 

•  Class-prior  change  adaptation:  Given  labeled  training  data  and  unlabeled  test  data 
having  different  class  balances,  we  estimate  the  class-balance  of  unlabeled  test  data 
by  matching  the  distribution  of  unlabeled  test  data  with  the  class-wise  mixture  of 
training  data  (Publication  9). 

•  Change-detection  in  time-series:  Given  time-series,  we  detect  change  points  at  which 
properties  of  time-series  switch  by  comparing  the  probability  distributions  of  current 
and  past  data  (Publication  19). 

•  Given  labeled  training  data  and  unlabeled  test  data  having  different  input  distribu¬ 
tions,  we  perform  distribution-adaptive  learning  for  reinforcement  learning  (Publi¬ 
cation  15)  and  probabilistic  classification  (Publication  16). 

•  Supervised  dimension  reduction:  Given  input-output  paired  data,  we  reduce  the 
dimensionality  of  input  data  by  maximizing  the  dependency  (Publication  7  and  24). 

•  Computationally  efficient  probabilistic  classification:  Given  labeled  data,  we  esti¬ 
mate  the  posterior  probability  of  labels  given  an  input  pattern  in  a  computationally 
efficient  way  (Publication  12). 

We  also  investigated  various  properties  of  information  estimators  for  further  develop¬ 
ment: 


•  Elucidation  of  statistical  and  numerical  properties  of  a  least-squares  kernel-based 
information  estimator  (Publication  18  and  25). 

•  Information  estimation  with  dimension  reduction  for  coping  with  high  dimensional¬ 
ity  (Publication  6). 
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•  Information  estimation  with  a  relative-divergence  and  a  difference-divergence  for 
enhancing  robustness  against  outliers  (Publications  8,  26,  and  10). 

•  A  unified  framework  of  information  estimation  for  better  understanding  mutual 
relation  among  different  information  measures  (Publication  21). 

•  Relation  to  a  kernel-based  independence  measure  (Publication  22). 

Finally,  we  published  monographs  and  review  articles  related  to  the  current  project: 

•  Monograph  on  density-ratio  estimation  (Publication  1). 

•  Monograph  on  non-stationarity  adaptation  (Publication  2). 

•  Review  article  on  non-stationarity  adaptation  (Publication  3). 

•  Review  article  on  information-based  learning  (Publication  23). 


4  Personnel  Supported 

The  research  activity  of  the  following  people  was  supported. 

•  Masashi  Sugiyama  (Tokyo  Institute  of  Technology), 

•  Makoto  Yamada  (Tokyo  Institute  of  Technology), 

•  Gang  Niu  (Tokyo  Institute  of  Technology), 

•  Marthinus  Christoffel  du  Plessis  (Tokyo  Institute  of  Technology). 

5  Publications 

During  the  24  months,  the  following  papers  were  published.  The  papers  indicated  by 
were  attached  to  this  report,  and  all  the  publications  are  available  from 

“http : / / sugiyama- www. cs . titech. ac . jp/~sugi/publications . html”. 

Books  and  Articles 

1.  Sugiyama,  M.,  Suzuki,  T.,  &  Kanamori,  T.  Density  Ratio  Estimation  in  Machine 
Learning,  344  pages,  Cambridge  University  Press,  Cambridge,  UK,  2012. 

2.  Sugiyama,  M.  &  Kawanabe,  M.  Machine  Learning  in  Non-Stationary  Environments: 
Introduction  to  Covariate  Shift  Adaptation,  308  pages,  MIT  Press,  Cambridge,  MA, 
USA,  2012. 

3.  *  Sugiyama,  M.  Learning  under  non-stationarity:  covariate  shift  adaptation  by  im¬ 
portance  weighting.  In  Handbook  of  Computational  Statistics:  Concepts  and  Meth¬ 
ods,  2nd  edition,  Chapter  31,  pp. 927-952,  Springer,  Berlin,  Germany,  2012. 
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Conference  Papers 

4.  *  Yamada,  M.  &  Sugiyama,  M.  Cross-domain  object  matching  with  model  selection. 
In  Proceedings  of  Fourteenth  International  Conference  on  Artificial  Intelligence  and 
Statistics  (AISTATS2011),  vol.15,  pp. 807-815,  Fort  Lauderdale,  Florida,  USA,  Apr. 
11-13,  2011. 

5.  *  Sugiyama,  M.,  Yamada,  M.,  Kimura,  M.,  &  Hachiya,  H.  On  information- 
maximization  clustering:  Tuning  parameter  selection  and  analytic  solution.  In 
Proceedings  of  28th  International  Conference  on  Machine  Learning  (ICML2011), 
pp. 65-72,  Bellevue,  Washington,  USA,  Jun.  28-Jul.  2,  2011. 

6.  Yamada,  M.  &  Sugiyama,  M.  Direct  density-ratio  estimation  with  dimensional¬ 
ity  reduction  via  hetero-distributional  subspace  analysis.  In  Proceedings  of  the 
Twenty-Fifth  AAAI  Conference  on  Artificial  Intelligence  (AAAI2011),  pp. 549-554, 
San  Francisco,  California,  USA,  Aug.  7-11,  2011. 

7.  Yamada,  M.,  Niu,  G.,  Takagi,  J.,  &  Sugiyama,  M.  Computationally  efficient  suffi¬ 
cient  dimension  reduction  via  squared-loss  mutual  information.  In  Proceedings  of 
the  Third  Asian  Conference  on  Machine  Learning  (ACML2011),  vol.20,  pp. 247-262, 
Taoyuan,  Taiwan,  Nov.  13-15,  2011. 

8.  Yamada,  M.,  Suzuki,  T.,  Kanamori,  T.,  Hachiya,  H.,  &  Sugiyama,  M.  Relative 
density-ratio  estimation  for  robust  distribution  comparison.  In  Advances  in  Neural 
Information  Processing  Systems  24  (NIPS2011),  pp. 594-602,  2011. 

9.  *  du  Plessis,  M.  C.  &  Sugiyama,  M.  Semi-supervised  learning  of  class  balance  under 
class-prior  change  by  distribution  matching.  In  Proceedings  of  29th  International 
Conference  on  Machine  Learning  (ICML2012),  pp. 823-830,  Edinburgh,  Scotland, 
Jun.  26-Jul.  1,  2012. 

10.  *  Sugiyama,  M.,  Suzuki,  T.,  Kanamori,  T.,  du  Plessis,  M.  C.,  Liu,  S.,  &  Takeuchi, 
I.  Density-difference  estimation.  In  Advances  in  Neural  Information  Processing 
Systems  25  (NIPS2012),  pp.692-700,  2012. 

Journal  Papers 

11.  Sugiyama,  M.  &  Suzuki,  T.  Least-squares  independence  test.  IEICE  Transactions 
on  Information  and  Systems,  vol.E94-D,  no. 6,  pp. 1333-1336,  2011. 

12.  Yamada,  M.,  Sugiyama,  M.,  Wichern,  G.,  &  Simm,  J.  Improving  the  accuracy 
of  least-squares  probabilistic  classifiers.  IEICE  Transactions  on  Information  and 
Systems,  vol.E94-D,  no. 6,  pp. 1337-1340,  2011. 

13.  *  Sugiyama,  M.,  Suzuki,  T.,  Itoh,  Y.,  Kanamori,  T.,  &  Kimura,  M.  Least-squares 
two-sample  test.  Neural  Networks,  vol.24,  no. 7,  pp. 735-751,  2011. 
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14.  Kimura,  M.  &  Sugiyama,  M.  Dependence-maximization  clustering  with  least- 
squares  mutual  information,  Journal  of  Advanced  Computational  Intelligence  and 
Intelligent  Informatics,  vol.15,  no. 7,  pp. 800-805,  2011. 

15.  Hachiya,  H.,  Peters,  J.,  &  Sugiyama,  M.  Reward  weighted  regression  with  sample 
reuse  for  direct  policy  search  in  reinforcement  learning.  Neural  Computation,  vol.23, 
no. 11,  pp. 2798-2832,  2011. 

16.  Hachiya,  H.,  Sugiyama,  M.  &  Ueda,  N.  Importance- weighted  least-squares  proba¬ 
bilistic  classifier  for  covariate  shift  adaptation  with  application  to  human  activity 
recognition.  Neurocomputing,  vol.80,  pp. 93-101,  2012. 

17.  Kanamori,  T.,  Suzuki,  T.,  &  Sugiyama,  M.  F-divergence  estimation  and  two-sample 
homogeneity  test  under  semiparametric  density-ratio  models.  IEEE  Transactions  on 
Information  Theory,  vol.58,  no. 2,  pp. 708-720,  2012. 

18.  Kanamori,  T.,  Suzuki,  T.,  &  Sugiyama,  M.  Statistical  analysis  of  kernel-based  least- 
squares  density-ratio  estimation.  Machine  Learning,  vol.86,  no. 3,  pp. 335-367,  2012. 

19.  Kawahara,  Y.  &  Sugiyama,  M.  Sequential  change-point  detection  based  on  direct 
density-ratio  estimation.  Statistical  Analysis  and  Data  Mining,  vol.5,  no. 2,  pp.114- 
127,  2012. 

20.  Karasuyama,  M.  &  Sugiyama,  M.  Canonical  dependency  analysis  based  on  squared- 
loss  mutual  information.  Neural  Networks,  vol.34,  pp. 46-55,  2012. 

21.  *  Sugiyama,  M.,  Suzuki,  T.,  &  Kanamori,  T.  Density  ratio  matching  under  the 
Bregman  divergence:  A  unified  framework  of  density  ratio  estimation.  Annals  of 
the  Institute  of  Statistical  Mathematics,  vol.64,  no. 5,  pp. 1009-1044,  2012. 

22.  Sugiyama,  M.  &  Yamada,  M.  On  kernel  parameter  selection  in  Hilbert-Schmidt 
independence  criterion.  IEICE  Transactions  on  Information  and  Systems,  vol.E95- 
D,  no.  10,  pp. 2564-2567,  2012. 

23.  *  Sugiyama,  M.  Machine  learning  with  squared-loss  mutual  information.  Entropy, 
vol.15,  no.l,  pp. 80-112,  2013. 

24.  Suzuki,  T.  &  Sugiyama,  M.  Sufficient  dimension  reduction  via  squared-loss  mutual 
information  estimation.  Neural  Computation,  vol.25,  no. 3,  pp. 725-758,  2013. 

25.  Kanamori,  T.,  Suzuki,  T.,  &  Sugiyama,  M.  Computational  complexity  of  kernel- 
based  density-ratio  estimation:  A  condition  number  analysis.  Machine  Learning, 
vol.90,  no.3,  pp. 431-460,  2013. 

26.  *  Yamada,  M.,  Suzuki,  T.,  Kanamori,  T.,  Hachiya,  H.,  &  Sugiyama,  M.  Relative 
density-ratio  estimation  for  robust  distribution  comparison.  Neural  Computation, 
vol.25,  no.5,  pp. 1324-1370,  2013. 
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6  Interactions 


I  had  discussion  with  my  program  manager,  Dr.  Hiroshi  Motoda,  during  the  Third  Asian 
Conference  on  Machine  Learning  in  Taiwan  on  Nov.  13-15,  2011,  and  received  detailed 
technical  comments  and  suggestions  for  this  project.  On  Nov.  20,  2012  at  my  office  at 
Tokyo  Institute  of  Technology,  I  had  a  meeting  with  Dr.  Hiroshi  Motoda  and  Lt  Col  Brian 
Sells  and  received  comments  and  suggestions  for  further  development. 

Below  is  the  list  of  my  presentations  related  to  the  project. 

1.  Jul.  5,  2011:  NEC  Laboratories  America,  USA. 

2.  Jul.  21,  2011:  ATR  Computational  Neuroscience  Labs.,  Japan 

3.  Aug.  12,  2011:  Yahoo!  Research,  USA. 

4.  Aug.  23,  2011:  ERATO  Project  Meeting,  Japan. 

5.  Sep.  15,  2011:  Hokkaido  University,  Japan. 

6.  Oct.  21,  2011:  SICE  seminar,  Japan. 

7.  Nov.  16,  2011:  National  Cheng  Rung  University,  Taiwan. 

8.  Nov.  17,  2011:  National  Taiwan  University,  Taiwan. 

9.  Nov.  22,  2011:  Symposium  on  Innovative  Algorithms  for  e-Scicnce,  Japan. 

10.  Dec.  10,  2011:  Empirical  Inference  Symposium,  Germany. 

11.  Dec.  20,  2011:  Toshiba  Corporation,  Japan. 

12.  Jan.  23,  2012:  Mines  ParisTech,  France. 

13.  Jan.  24,  2012:  Ecole  Normale  Superieure,  France. 

14.  Jan.  25,  2012:  INRIA  Lille,  France. 

15.  Jan.  27,  2012:  FIRST  Project  Meeting,  Japan. 

16.  Feb.  17,  2012:  IPAB  Seminar,  Japan. 

17.  Apr.  25,  2012:  Computational  Science  Simulation  Symposium,  Japan. 

18.  Jun.  11,  2012:  Keio  University,  Japan. 

19.  Aug.  6,  2012:  Workshop  on  Machine  Learning  and  Applications  to  Biology,  Japan. 

20.  Aug.  8,  2012:  Hokkaido  University,  Japan. 

21.  Sep.  7,  2012:  21st  Machine  Learning  Summer  School,  Japan. 
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22.  Sep.  25,  2012:  BBCI  Summer  School  2012,  Germany. 

23.  Dec.  14,  2012:  PRESTO  Project  Meeting,  Japan. 

24.  Dec.  17,  2012:  Toshiba  Corporation,  Japan. 

25.  Feb.  20,  2013:  International  Winter  Workshop  on  Brain-Computer  Interface,  Korea. 

26.  Feb.  22,  2013:  Seoul  National  University,  Korea. 

27.  Feb.  26,  2013:  NTT  Communication  Science  Laboratories,  Japan. 

28.  Mar.  6,  2013:  Nagoya  Institute  of  Technology,  Japan. 

29.  Mar.  18,  2013:  Aalto  University,  Finland. 

30.  Mar.  20,  2013:  VALO  Research  and  Trading,  Finland. 

31.  Mar.  20,  2013:  University  of  Helsinki,  Finland. 

32.  Apr.  25,  2013:  Omron  Corporation,  Japan. 

33.  May  21,  2013:  JSAE-SICE  Symposium,  Japan. 

7  Inventions 

None. 

8  Honors  /  Award 

I  received  four  awards  related  to  the  current  project. 

1.  Jun.  19,  2012:  IBISML  Award  Finalist,  IEICE,  Information-Based  Induction  Sci¬ 
ences  and  Machine  Learning  Technical  Group. 

2.  Apr.  14,  2012:  Funai  Award,  Funai  Foundation  for  Information  Technology. 

3.  Dec.  16,  2011:  JNNS  Best  Paper  Award,  Japanese  Neural  Network  Society. 

4.  Jun.  2,  2011:  Nagao  Special  Researcher  Award,  Information  Processing  Society  of 
Japan. 

9  Archival  Documentation 

Selected  papers  (Publications  3,  4,  5,  9,  10,  13,  21,  23,  and  26)  are  attached  as  archival 
documentation.  All  the  publications  listed  in  Section  5  are  available  from 

“http : / / sugiyama-www. cs . titech. ac . jp/~sugi/publications . html”. 


10  Software 


Implementation  of  various  machine  learning  algorithms  (mostly  in  MATLAB)  is  available 
from  my  web  page: 

“http : //sugiyaraa-www. cs . titech . ac . jp/~sugi/software/ index.html” . 
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In  J.  E.  Gentle,  W.  Hardle,  and  Y.  Mori  (Eds.), 

Handbook  of  Computational  Statistics:  Concepts  and  Methods, 

2nd  edition.  Chapter  31,  pp. 927-952,  Springer,  Berlin,  Germany,  2012. 


1 


Learning  under  Non-stationarity: 

Covariate  Shift  Adaptation  by  Importance  Weighting 

Masashi  Sugiyama 
Tokyo  Institute  of  Technology 
2-12-1  O-okayama,  Meguro-ku,  Tokyo  152-8552,  Japan. 
sugi@cs.titech.ac.jp  http:/ /sugiyama- www.es. titech.ac.jp/~sugi 


Abstract 

The  goal  of  supervised  learning  is  to  estimate  an  underlying  input-output  function 
from  its  input-output  training  samples  so  that  output  values  for  unseen  test  input 
points  can  be  predicted.  A  common  assumption  in  supervised  learning  is  that  the 
training  input  points  follow  the  same  probability  distribution  as  the  test  input 
points.  However,  this  assumption  is  not  satisfied,  for  example,  when  outside  of  the 
training  region  is  extrapolated.  The  situation  where  the  training  and  test  input 
points  follow  different  distributions  while  the  conditional  distribution  of  output 
values  given  input  points  is  unchanged  is  called  covariate  shift.  Since  almost  all 
existing  learning  methods  assume  that  the  training  and  test  samples  are  drawn  from 
the  same  distribution,  their  fundamental  theoretical  properties  such  as  consistency 
or  efficiency  no  longer  hold  under  covariate  shift.  In  this  chapter,  we  review  recently 
proposed  techniques  for  covariate  shift  adaptation. 


1  Introduction 

The  goal  of  supervised  learning  is  to  infer  an  unknown  input-output  dependency  from 
training  samples,  by  which  output  values  for  unseen  test  input  points  can  be  predicted. 
When  developing  a  method  of  supervised  learning,  it  is  commonly  assumed  that  the  input 
points  in  the  training  set  and  the  input  points  used  for  testing  follow  the  same  probability 
distribution  (Wahba,  1990;  Bishop,  1995;  Vapnik,  1998;  Duda  et  ah,  2001;  Hastie  et  ah, 
2001;  Scholkopf  &  Smola,  2002).  However,  this  common  assumption  is  not  fulfilled,  for 
example,  when  outside  of  the  training  region  is  extrapolated  or  when  training  input  points 
are  designed  by  an  active  learning  (a.k.a.  experimental  design)  algorithm  (Wiens,  2000; 
Kanamori  &  Shimodaira,  2003;  Sugiyama,  2006;  Kanamori,  2007;  Sugiyama  &  Naka- 
jima,  2009).  Situations  where  training  and  test  input  points  follow  different  probability 
distributions  but  the  conditional  distributions  of  output  values  given  input  points  are  un¬ 
changed  are  called  covariate  shift  (Shimodaira,  2000).  In  this  chapter,  we  review  recently 
proposed  techniques  for  alleviating  for  the  influence  of  covariate  shift. 
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Under  covariate  shift,  standard  learning  techniques  such  as  maximum  likelihood  esti¬ 
mation  are  biased.  It  was  shown  that  the  bias  caused  by  covariate  shift  can  be  asymptot¬ 
ically  canceled  by  weighting  the  loss  function  according  to  the  importance — the  ratio  of 
test  and  training  input  densities  (Shimodaira,  2000;  Zadrozny,  2004;  Sugiyama  &  Muller, 
2005;  Sugiyama  et  ah,  2007;  Quinonero-Candela  et  al.,  2009;  Sugiyama  &  Kawanabe, 
2011).  Similarly,  standard  model  selection  criteria  such  as  cross-validation  (Stone,  1974; 
Wahba,  1990)  or  Akaike’s  information  criterion  (Akaike,  1974)  lose  their  unbiasedness 
under  covariate  shift.  It  was  shown  that  proper  unbiasedness  can  also  be  recovered  by 
modifying  the  methods  based  on  importance  weighting  (Shimodaira,  2000;  Zadrozny, 
2004;  Sugiyama  &  Muller,  2005;  Sugiyama  et  ah,  2007). 

As  explained  above,  the  importance  weight  plays  a  central  role  in  covariate  shift  adap¬ 
tation.  However,  since  the  importance  weight  is  unknown  in  practice,  it  should  be  esti¬ 
mated  from  data.  A  naive  approach  to  this  task  is  to  first  use  kernel  density  estimation 
(Hardle  et  ah,  2004)  for  obtaining  estimators  of  the  training  and  test  input  densities,  and 
then  taking  the  ratio  of  the  estimated  densities.  However,  division  by  estimated  quan¬ 
tities  can  magnify  the  estimation  error,  so  directly  estimating  the  importance  weight  in 
a  single-shot  process  would  be  more  preferable.  Following  this  idea,  various  methods  for 
directly  estimating  the  importance  have  been  explored  (Silverman,  1978;  Cwik  &  Miel- 
niczuk,  1989;  Qin,  1998;  Cheng  &  Chu,  2004;  Huang  et  ah,  2007;  Bickel  et  ah,  2007; 
Sugiyama  et  ah,  2008;  Kanamori  et  ah,  2009a).  These  direct  estimation  approaches  have 
been  demonstrated  to  be  more  accurate  than  the  two-step  density  estimation  approach. 

Examples  of  successful  real-world  applications  of  covariate  shift  adaptation  include 
brain-computer  interface  (Sugiyama  et  ah,  2007),  robot  control  (Hachiya  et  ah,  2009; 
Akiyama  et  ah,  2010;  Hachiya  et  ah,  2011),  speaker  identification  (Yamada  et  ah,  2010a), 
age  prediction  from  face  images  (Ueki  et  ah,  2011),  wafer  alignment  in  semiconductor  ex¬ 
posure  apparatus  (Sugiyama  &  Nakajima,  2009),  and  natural  language  processing  (Tsuboi 
et  ah,  2009). 

The  rest  of  this  chapter  is  organized  as  follows.  In  Section  2,  the  problem  of  supervised 
learning  under  covariate  shift  is  mathematically  formulated.  In  Section  3,  various  learning 
methods  under  covariate  shift  are  introduced.  In  Section  4,  the  issue  of  model  selection 
under  covariate  shift  is  addressed.  In  Section  5,  methods  of  importance  estimation  are 
reviewed.  Finally,  we  conclude  in  Section  6. 

A  more  extensive  description  of  covariate  shift  adaptation  techniques  is  available  in 
Sugiyama  and  Kawanabe  (2011). 


2  Formulation  of  Supervised  Learning  under  Covari¬ 
ate  Shift 

In  this  section,  we  formulate  the  supervised  learning  problem  under  covariate  shift. 

Let  us  consider  the  supervised  learning  problem  of  estimating  an  unknown  input- 
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Figure  1:  Framework  of  supervised  learning. 


output  dependency  from  training  samples.  Let 

{(xf,yT)  |  x?  ex  c  Rd,y?  eye  M}^, 

be  the  training  samples,  xf  is  a  training  input  point  drawn  from  a  probability  distribution 
with  density  ptr(x).  yf  is  a  training  output  value  following  a  conditional  probability 
distribution  with  conditional  density  p(y \x  =  x *r).  p(y\x)  may  be  regarded  as  the  sum  of 
the  true  output  f(x)  and  noise  e: 


V  =  /(*)  +  £■ 

We  assume  that  the  noise  e  has  mean  0  and  variance  a2.  This  formulation  is  summarized 
in  Figure  1. 

Let  (a;te,  yte )  be  a  test  sample,  which  is  not  given  to  the  user  in  the  training  phase,  but 
will  be  provided  in  the  test  phase  in  the  future.  xte  e  X  is  a  test  input  point  following 
a  probability  distribution  with  density  pte{x),  which  is  different  from  ptT(x).  yte  E  y  is 
a  test  output  value  following  p(y \x  =  £Cte),  which  is  the  same  conditional  density  as  the 
training  phase.  The  goal  of  supervised  learning  is  to  obtain  an  approximation  f(x)  to 
the  true  function  f(x)  for  predicting  the  test  output  value  yte.  More  formally,  we  would 
like  to  obtain  the  approximation  f(x)  that  minimizes  the  test  error  expected  over  all  test 
samples  (which  is  called  the  generalization  error): 


G  :  = 


E  E 

xte  yte 


\osS(f(xte),yte) 


J 


where  Ea;te  denotes  the  expectation  over  xte  drawn  from  pte(x)  and  Ryte  denotes  the  ex¬ 
pectation  over  yte  drawn  from  p(y \x  =  ajte).  loss (y,  y)  is  the  loss  function  which  measures 
the  discrepancy  between  the  true  output  value  y  and  its  estimate  y.  When  the  output 
domain  y  is  continuous,  the  problem  is  called  regression  and  the  squared-loss  is  often 
used. 


loss  (y,y)  =  ( y~y)2 ■ 


Learning  under  Non-stationarity 


4 


On  the  other  hand,  when  y  =  {+1,-1},  the  problem  is  called  (binary)  classification  and 
the  0/1-loss  is  a  typical  choice. 


loss (y,  y) 


0  if  sgn(y)  =  y, 
1  otherwise, 


where  sgn(y)  =  +1  if  y  >  0  and  sgn(y)  =  —1  if  y  <  0.  Note  that  the  0/1-loss  corresponds 
to  the  misclassihcation  rate. 

We  use  a  parametric  function  f{x\0)  for  learning,  where  0  is  a  parameter.  A 
model  f(x]6 )  is  said  to  be  correctly  specified  if  there  exists  a  parameter  6*  such  that 
f(x;0*)  =  /(cc);  otherwise  the  model  is  said  to  be  misspecified.  In  practice,  the  model 
used  for  learning  would  be  misspecified  to  a  greater  or  less  extent  since  we  do  not  generally 
have  enough  prior  knowledge  for  correctly  specifying  the  model.  Thus  it  is  important  to 
consider  misspecified  models  when  developing  machine  learning  algorithms. 

In  standard  supervised  learning  theories  (Wahba,  1990;  Bishop,  1995;  Vapnik,  1998; 
Duda  et  ah,  2001;  Hastie  et  ah,  2001;  Scholkopf  &  Smola,  2002),  the  test  input  point  xte 
is  assumed  to  follow  the  same  distribution  as  the  training  input  point  xtT.  On  the  other 
hand,  in  this  chapter,  we  consider  the  situation  called  covariate  shift  (Shimodaira,  2000), 
i.e. ,  the  training  input  point  xtr  and  the  test  input  point  xtc  have  different  distributions. 
Under  covariate  shift,  most  of  the  standard  learning  techniques  do  not  work  well  due  to 
the  differing  distributions.  Below,  we  review  recently  developed  techniques  for  mitigating 
the  influence  of  covariate  shift. 


3  Function  Learning  under  Covariate  Shift 

A  standard  method  to  learn  the  parameter  6  in  the  model  f(x\Q)  would  be  empirical 
risk  minimization  (ERM)  (Vapnik,  1998;  Scholkopf  &  Smola,  2002): 

Germ  ■=  argmin 
e 

If  ptfix)  =  pte(x),  #erm  converges  to  the  optimal  parameter  6*  (Shimodaira,  2000): 

6*  :=  argmin  [G] . 
e 

However,  under  covariate  shift  where  pt/x)  /  pte(x),  #erm  does  not  converge  to  6*  if  the 
model  is  misspecified1. 

In  this  section,  we  review  various  learning  methods  for  covariate  shift  adaptation  and 
show  their  numerical  examples. 


ntr 

—  5^loss  (f{x\T\0),y?) 

'Hr  ■  -i 
i=l 


10erm  still  converges  to  0*  under  covariate  shift  if  the  model  is  correctly  specified. 
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3.1  Importance  Weighting  Techniques  for  Covariate  Shift  Adap¬ 
tation 

Here,  we  introduce  various  regression  and  classification  techniques  for  covariate  shift  adap¬ 
tation. 


3.1.1  Importance  Weighted  ERM 

The  inconsistency  of  ERM  is  due  to  the  difference  between  training  and  test  input  dis¬ 
tributions.  Importance  sampling  (Fishman,  1996)  is  a  standard  technique  to  compensate 
for  the  difference  of  distributions.  The  following  identity  shows  the  essence  of  importance 
sampling: 


g(x)pte(x)dx 


Pte(x) 

Ptr(x) 


Ptr(x)dx 


E 

cr.^T 


Pte(xtT) 

Ptv{xtT) 


1 


where  Extr  and  Ea;te  denote  the  expectations  over  xtv  and  xte  drawn  from  ptT(x)  and 
pte(x),  respectively.  The  quantity 

Pte(x) 

Ptr(x) 

is  called  the  importance.  The  above  identity  shows  that  the  expectation  of  a  function 
g(x)  over  pte(x)  can  be  computed  by  the  importance-weighted  expectation  of  g(x)  over 
ptT(x).  Thus  the  difference  of  distributions  can  be  systematically  adjusted  by  importance 
weighting. 

Applying  the  above  importance  weighting  technique  to  ERM,  we  obtain  importance- 
weighted  ERM  (IWERM): 


0IWERM  :=  argmin 
e 


1 

TCtr 


ntr 

£ 

1=1 


Pte{X 


tr'i 


Ptr\X 


tr’ 


-loss  (f(xiT;0),yi 


,tr  \ 


^ iwerm  converges  to  6*  under  covariate  shift,  even  if  the  model  is  misspecified  (Shi- 
modaira,  2000).  In  practice,  IWERM  may  be  regularized ,  e.g.,  by  slightly  flattening  the 
importance  weight  and/or  adding  a  penalty  term  as 


argmin 

e 


ntr 


ntT 

£ 

i=  1 


PtelSj 
Ptr(x f) 


tr\  \  7 


loss(/(:r/;0),i/ir)  +  \6  0 


where  0  <  7  <  1  is  the  flattening  parameter,  A  >  0  is  the  regularization  parameter,  and 
T  denotes  the  transpose  of  a  matrix  or  a  vector. 


3.1.2  Importance- Weighted  Regression  Methods 

Least-squares  (LS)  would  be  one  of  the  most  fundamental  regression  techniques.  The 
importance- weighted  regression  method  for  the  squared-loss  (see  Figure  2),  called 
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Figure  2:  Loss  functions  for  regression,  y  is  the  true  output  value  at  an  input  point  and 
y  is  its  estimate. 


importance-weighted,  LS  (IWLS),  is  given  as  follows: 


0iwls  :=  argmin 
e 


1  ^  /W**"7 


E 


™tr  VPtrfc?) 
Let  us  employ  the  following  linear  model: 


/(**;*)- 


tr 


\eTe 


f(x;G)  = 


i=i 


(1) 


(2) 


where  {4>i(x)}be=1  are  fixed  linearly-independent  basis  functions.  Then  the  solution  0iwls 
is  given  analytically  as 

0IWls  =  (XtlTW^Xtr  +  ntrAIfe)-1XtrT^Vr,  (3) 

where  Xtr  is  the  design  matrix ,  i.e.,  Xtr  is  the  ntr  x  b  matrix  with  the  th  element 
=  (j)e(x\T),  W  is  the  diagonal  matrix  with  the  i-th  diagonal  element  '-ppjytrj-  lb  is  the 
5-dimensional  identity  matrix,  and  ytT  is  the  ntr-dimensional  vector  with  the  i-th  element 
V?- 

The  LS  method  often  suffers  from  excessive  sensitivity  to  outliers  (i.e.,  irregular  values) 
and  less  reliability.  A  popular  alternative  is  importance-weighted  least  absolute  (IWLA) 
regression — instead  of  the  squared  loss,  the  absolute  loss  is  used  (see  Figure  2): 


0iwla  =  argmin 

9 


Wtr 


E 

i= 1 


Pte{X 


tr'i 


Pti\X 


tr'i 


o)  —  y\ 


,tr 


+  xeTe 


For  the  linear  model  (2),  the  above  optimization  problem  is  reduced  to  a  quadratic  pro¬ 
gram,  which  can  be  solved  by  a  standard  optimization  software.  If  the  regularization  term 
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oTo  is  replaced  by  the  hi -regularize!'  \0(\->  the  optimization  problem  is  reduced  to  a 

linear  program,  which  may  be  solved  more  efficiently.  Furthermore,  the  fd-regularizer  is 
known  to  induce  a  sparse  solution  (Williams,  1995;  Tibshirani,  1996;  Chen  et  al.,  1998). 

Although  the  LA  method  is  robust  against  outliers,  it  tends  to  have  a  large  variance 
when  the  noise  is  Gaussian.  The  use  of  the  Huber  loss  can  mitigate  this  problem: 


0Huber  =  argmin 
e 


nt  r 


ntr 

E 

i= 1 


,tr  > 


Pte{X. 


Pr  /(<;  0) 


vT 


+  a  eTe 


where  r  (>  0)  is  the  robustness  parameter  and  pT  is  the  Huber  loss  defined  as  follows  (see 
Figure  2): 

,  .  \ \v 2  if  \y\  <  t, 

pr{y)  ■=  < 

[r\y\  -  \r2  If  \y\  >  r. 

Thus,  the  squared  loss  is  applied  to  ‘good’  samples  with  small  fitting  error,  and  the 
absolute  loss  is  applied  to  ‘bad’  samples  with  large  fitting  error.  The  above  optimization 
problem  can  be  reduced  to  a  quadratic  program  (Mangasarian  &  Musicant,  2000),  which 
can  be  solved  by  a  standard  optimization  software. 

Another  variant  of  the  IWLA  is  importance-weighted  support  vector  regression 
(IWSVR): 


0svr  =  argmin 
e 


/(<;  o) 


yT 


+  xeTe 


where  |  ■  |£  is  the  deadzone-linear  loss  (or  Vapnik’s  e-insensitive  loss )  defined  as  follows 
(see  Figure  2): 


\x\ 


0  if  |x|  <  e, 

lad  —  e  if  \x\  >  e. 


For  the  linear  model  (2),  the  above  optimization  problem  is  reduced  to  a  quadratic  pro¬ 
gram  (Vapnik,  1998),  which  can  be  solved  by  a  standard  optimization  software.  If  the 
regularization  term  0  6  is  replaced  by  the  G-regularizer  Xu=i  1^1  >  the  optimization  prob¬ 
lem  is  reduced  to  a  linear  program. 


3.1.3  Importance- Weighted  Classification  Methods 


In  the  binary  classification  scenario  where  y  =  {+1,-1},  Fisher  discriminant  analysis 
(FDA)  (Fisher,  1936),  logistic  regression  (LR)  (Hastie  et  ah,  2001),  support  vector  machine 
(SVM)  (Vapnik,  1998;  Scholkopf  &  Smola,  2002),  and  boosting  (Freund  &  Schapire,  1996; 
Breiman,  1998;  Friedman  et  ah,  2000)  would  be  popular  learning  algorithms.  They  can 
be  regarded  as  ERM  methods  with  different  loss  functions  (see  Figure  3). 

An  importance-weighted  version  of  FDA,  IWFDA,  is  given  by 


0iwfda  :=  argmin 

9 


1 

ntr^\Ptr{xf)J  V 


yff  (xT'i  0) )  +mt0 


Learning  under  Non-stationarity 


yy 

Figure  3:  Loss  functions  for  classification,  y  is  the  true  output  value  at  an  input  point 
and  y  is  its  estimate. 


which  is  essentially  equivalent  to  Eq.(l)  since  (yf)2  =  1. 

An  importance-weighted  version  of  LR,  IWLR,  is  given  by 


0iwlr  :=  argmin 
e 


E 

i= 1 


Pte{X 


tr'i 


Ptr{X 


tr'i 


log  1  +  exp  -yff(xf;  0)  +  A0  1  0 


which  is  usually  solved  by  (quasi-) Newton  methods. 

An  importance- weighted  version  of  SVM,  IWSVM,  is  given  by 

^iwsvm  :=  argmin 
e 

whose  solution  can  be  obtained  by  a  standard  quadratic  programming  solver. 
An  importance-weighted  version  of  Boosting,  IWB,  is  given  by 

§(^)1exp(-^‘'(,))  +  AeTe  - 

which  is  usually  solved  by  stage-wise  optimization. 


#iwb  :=  argmin 
e 


E 

1=1 


tiA 


Vte{X 

Ptv{xf) 


max 


0,1 


yTf\xf;0))+X0T0 


3.2  Numerical  Examples 

Here  we  illustrate  the  behavior  of  IWERM  using  toy  regression  and  classification  data 
sets. 


3.2.1  Regression 

Let  us  consider  one-dimensional  regression  problem.  Let  the  learning  target  function  be 
f(x )  =  sinc(x),  and  let  the  training  and  test  input  densities  be 

ptr(x)  =  N(x ;  1,  (1/2)2)  and  pte(x)  =  N(x ;  2,  (1/4)2), 
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(c)  7  =  0.5 


(d)  7  =  1 


Figure  4:  An  illustrative  regression  example  with  covariate  shift,  (a)  The  probability 
density  functions  of  the  training  and  test  input  points  and  their  ratio  (i.e. ,  the  importance). 
(b)-(d)  The  learning  target  function  f(x)  (the  solid  line),  training  samples  (‘o’),  a  learned 
function  f(x)  (the  dashed  line),  and  test  samples  (‘x’). 


where  N(x]  /i,  u2)  denotes  the  Gaussian  density  with  mean  /i  and  variance  a2.  As  illus¬ 
trated  in  Figure  4(a),  we  are  considering  a  (weak)  extrapolation  problem  since  the  training 
input  points  are  distributed  in  the  left-hand  side  of  the  input  domain  and  the  test  input 
points  are  distributed  in  the  right-hand  side. 

We  create  the  training  output  value  {y\r}™=i  as  yf  =  f(xf)  +  e*r,  where  {e)r}”=r,  are 
i.i.d.  noise  drawn  from  N(e ;  0,  (1/4)2).  Let  the  number  of  training  samples  be  ntr  =  150, 
and  we  use  the  following  linear  model: 

f(x;  G )  =  Qxx  +  02. 

The  parameter  6  is  learned  by  IWLS. 

Here  we  fix  the  regularization  parameter  to  A  =  0,  and  compare  the  performance 
of  IWLS  for  7  =  0,0.5, 1.  When  7  =  0,  a  good  approximation  of  the  left-hand  side  of 
the  sine  function  can  be  obtained  (see  Figure  4(b)).  However,  this  is  not  appropriate 
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(a)  Optimal  decision  boundary  (the  thick  solid  (b)  Optimal  decision  boundary  (solid  line)  and 
line)  and  contours  of  training  and  test  input  den-  learned  boundaries  (dashed  lines),  ‘o’  and  ‘x’ 
sities  (thin  solid  lines).  denote  the  positive  and  negative  training  sam¬ 

ples,  while  and  “+’  denote  the  positive  and 
negative  test  samples. 

Figure  5:  An  illustrative  classification  example  with  covariate  shift. 


for  estimating  the  test  output  values  (‘x’  in  the  figure).  Thus,  IWLS  with  7  =  0  (i.e. , 
ordinary  LS)  results  in  a  large  test  error.  Figure  4(d)  depicts  the  learned  function  when 
7  =  1,  which  tends  to  approximate  the  test  output  values  well,  but  having  a  large  variance. 
Figure  4(c)  depicts  a  learned  function  when  7  =  0.5,  which  yields  even  better  estimation 
of  the  test  output  values  for  this  particular  data  realization. 


3.2.2  Classification 


Let  us  consider  a  binary  classification  problem  on  the  two-dimensional  input  space.  Let 
the  class  posterior  probabilities  given  input  x  be 

p(y  —  +1 1  x)  —  ^  (l  +  tanh  (a^1)  +  min(0,  x^)))  ,  (4) 

where  x  =  (2/1),  a/2))T  and  p(y  —  —  1 1  x)  —  1  —  p(y  =  +l|£c).  The  optimal  decision 
boundary,  i.e.,  a  set  of  all  x  such  that  p(y  —  +1 1  a;)  =  p(y  =  —  1 1  x)  =  1/2  is  illustrated 
in  Figure  5(a). 

Let  the  training  and  test  input  densities  be 
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where  N(x]  ft,  E)  is  the  multivariate  Gaussian  density  with  mean  gL  and  covariance  matrix 
X.  This  setup  implies  that  we  are  considering  a  (weak)  extrapolation  problem.  Contours 
of  the  training  and  test  input  densities  are  illustrated  in  Figure  5(a). 

Let  the  number  of  training  samples  be  ntr  =  500,  and  we  create  training  input  points 
{a^r}”lr1  following  ptT(x)  and  training  output  labels  {yf}^Li  following  p(y\x  =  xf).  Sim¬ 
ilarly,  let  the  number  of  test  samples  be  nte  =  500,  and  we  create  nte  test  input  points 
{xje}”=i  following  pte{x)  and  test  output  labels  {yf}™=\  following  p(y\x  =  x )e).  We  use 
the  following  linear  model: 

/(*;  9)  =  6lix(1)  +  92x('2)  +  03. 


The  parameter  6  is  learned  by  IWFDA. 

Here  we  fix  the  regularization  parameter  to  A  =  0,  and  compare  the  performance  of 
IWFDA  for  7  =  0,  0.5, 1.  Figure  5(b)  depicts  an  example  of  realizations  of  training  and 
test  samples,  and  decision  boundaries  obtained  by  IWFDA.  For  this  particular  realization 
of  data  samples,  7  =  0.5  or  1  works  better  than  7  =  0. 

4  Model  Selection  under  Covariate  Shift 

As  illustrated  in  the  previous  section,  importance-weighting  is  a  promising  approach  to 
covariate  shift  adaptation,  given  that  the  flattening  parameter  7  is  chosen  appropriately. 
Although  7  =  0.5  worked  well  both  for  the  toy  regression  and  classification  experiments 
in  the  previous  section,  7  =  0.5  may  not  always  be  the  best  choice.  Indeed,  an  appro¬ 
priate  value  of  7  depends  on  the  learning  target  function,  models,  the  noise  level  in  the 
training  samples,  etc.  Therefore,  model  selection  needs  to  be  appropriately  carried  out 
for  enhancing  the  generalization  capability  under  covariate  shift. 

The  goal  of  model  selection  is  to  determine  the  model  (e.g,  basis  functions,  the  flat¬ 
tening  parameter  7,  and  the  regularization  parameter  A)  so  that  the  generalization  error 
is  minimized  (Akaike,  1970;  Mallows,  1973;  Akaike,  1974;  Takeuchi,  1976;  Schwarz,  1978; 
Rissanen,  1978;  Craven  &  Wahba,  1979;  Akaike,  1980;  Rissanen,  1987;  Shibata,  1989; 
Wahba,  1990;  Efron  &  Tibshirani,  1993;  Murata  et  al.,  1994;  Konishi  &  Kitagawa,  1996; 
Ishiguro  et  ah,  1997;  Vapnik,  1998;  Sugiyama  &  Ogawa,  2001;  Sugiyama  &  Muller,  2002; 
Sugiyama  et  ah,  2004).  The  true  generalization  error  is  not  accessible  since  it  contains  the 
unknown  learning  target  function.  Thus,  some  generalization  error  estimator  needs  to  be 
used  instead.  However,  standard  generalization  error  estimators  such  as  cross-validation 
(CV)  are  heavily  biased  under  covariate  shift,  and  therefore  they  are  no  longer  reliable. 
In  this  section,  we  review  a  modified  CV  method  that  possesses  proper  unbiasedness  even 
under  covariate  shift. 

4.1  Importance- Weighted  Cross-Validation 

One  of  the  popular  techniques  for  estimating  the  generalization  error  is  CV  (Stone,  1974; 
Wahba,  1990).  CV  has  been  shown  to  give  an  almost  unbiased  estimate  of  the  general- 
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ization  error  with  finite  samples  (Luntz  &  Brailovsky,  1969;  Scliolkopf  &  Srnola,  2002). 
However,  such  almost  unbiasedness  is  no  longer  fulfilled  under  covariate  shift. 

To  cope  with  this  problem,  a  variant  of  CV  called  importance-weighted,  CV  (IWCV) 
has  been  proposed  (Sugiyama  et  ah,  2007).  Let  us  randomly  divide  the  training  set 
Z  =  {(xf,  2/fr) }^Lri  into  k  disjoint  non-empty  subsets  {Zi}jL1  of  (approximately)  the  same 
size.  Let  fzXx)  be  a  function  learned  from  {27;/}^  (i.e.,  without  Zi).  Then  the  k-fold 
IWCV  (AdWCV)  estimate  of  the  generalization  error  G  is  given  by 

iwcv  =  £  \z~\  hss(  fZt(x),y), 


where  \Zi\  is  the  number  of  samples  in  the  subset  Zit. 

When  k  =  ntr,  AdWCV  is  particularly  called  IW  leave-one-out  CV  (IWLOOCV): 
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where  /)( x)  is  a  function  learned  from  {(x\j,  (i.e.,  without  (xf,  yf))-  It  was 

proved  that  IWLOOCV  gives  an  almost  unbiased  estimate  of  the  generalization  error 
even  under  covariate  shift  (Sugiyama  et  ah,  2007).  More  precisely,  IWLOOCV  for  ntr 
training  samples  gives  an  unbiased  estimate  of  the  generalization  error  for  ntT  —  1  training 
samples: 


E  E 

{*nsri  orisi 


OiWLOOCV 


E  E  [G'}  «  E  E  [G], 

{*jrisri  onrs  {yr}Z\ 


where  E^v}”^  denotes  the  expectation  over  {a:)r}”lr1  drawn  i.i.d.  from  ptT(x),  E^tr}"^ 
denotes  the  expectation  over  {?/*r}”Lr1  each  drawn  from  p(y\x  =  xf),  and  G'  denotes 
the  generalization  error  for  —  1  training  samples.  A  similar  proof  is  also  possible  for 
MWCV,  but  the  bias  is  slightly  larger  (Hastie  et  al.,  2001). 

The  almost  unbiasedness  of  IWCV  holds  for  any  loss  function,  any  model,  and  any 
parameter  learning  method;  even  lion-identifiable  models  (Watanabe,  2009)  or  lion- 
parametric  learning  methods  (Scliolkopf  &  Smola,  2002)  are  allowed.  Thus  IWCV  is 
a  highly  flexible  model  selection  technique  under  covariate  shift.  For  other  model  selec¬ 
tion  criteria  under  covariate  shift,  see  Shimodaira  (2000)  for  regular  models  with  smooth 
losses  and  Sugiyama  and  Miiller  (2005)  for  linear  models  with  the  squared  loss. 


4.2  Numerical  Examples 

Here  we  illustrate  the  behavior  of  IWCV  using  the  same  toy  data  sets  as  Section  3.2. 

4.2.1  Regression 

Let  us  continue  the  one-dimensional  regression  simulation  in  Section  3.2.1. 
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As  illustrated  in  Figure  4  in  Section  3.2.1,  IWLS  with  flattening  parameter  7  =  0.5 
appears  to  work  well  for  that  particular  realization  of  data  samples.  However,  the  best 
value  of  7  would  depend  on  the  realization  of  samples.  In  order  to  investigate  this  sys¬ 
tematically,  let  us  repeat  the  simulation  1000  times  with  different  random  seeds,  i.e. ,  in 
each  run  {(x-r,  e)’r)}r=i  are  randomly  drawn  and  the  scores  of  10-fold  IWCV  and  10-fold 
ordinary  CV  are  calculated  for  7  =  0,  0.1, 0.2, . . . ,  1.  The  means  and  standard  deviations 
of  the  generalization  error  G  and  its  estimate  by  each  method  are  depicted  as  functions 
of  7  in  Figure  6.  The  graphs  show  that  IWCV  gives  very  accurate  unbiased  estimates  of 
the  generalization  error,  while  ordinary  CV  is  heavily  biased. 

Next  we  investigate  the  model  selection  performance.  The  flattening  parameter  7 
is  chosen  from  {0,  0.1,  0.2, . . . ,  1}  so  that  the  score  of  each  model  selection  criterion  is 
minimized.  The  mean  and  standard  deviation  of  the  generalization  error  G  of  the  learned 
function  obtained  by  each  method  over  1000  runs  are  described  in  Table  1.  This  shows 
that  IWCV  gives  significantly  smaller  generalization  errors  than  ordinary  CV,  under  the 
t-test  (Henkel,  1976)  at  the  significance  level  5%.  For  reference,  the  generalization  error 
when  the  flattening  parameter  7  is  chosen  optimally  (i.e.,  in  each  trial,  7  is  chosen  so  that 
the  true  generalization  error  is  minimized)  is  described  as  ‘Optimal’  in  the  table.  The 
result  shows  that  the  performance  of  IWCV  is  rather  close  to  that  of  the  optimal  choice. 

4.2.2  Classification 

Let  us  continue  the  toy  classification  simulation  in  Section  3.2.2. 

I11  Figure  5(b)  in  Section  3.2.2,  IWFDA  with  a  middle/large  flattening  parameter  7 
appears  to  work  well  for  that  particular  realization  of  samples.  Here,  we  investigate  the 
choice  of  the  flattening  parameter  value  by  IWCV  and  ordinary  CV.  Figure  7  depicts  the 
means  and  standard  deviations  of  the  generalization  error  G  (i.e.,  the  misclassification 
rate)  and  its  estimate  by  each  method  over  1000  runs,  as  functions  of  the  flattening 
parameter  7  in  IWFDA.  The  graphs  clearly  show  that  IWCV  gives  much  better  estimates 
of  the  generalization  error  than  ordinary  CV. 

Next  we  investigate  the  model  selection  performance.  The  flattening  parameter  7 
is  chosen  from  {0,  0.1,  0.2, . . . ,  1}  so  that  the  score  of  each  model  selection  criterion  is 
minimized.  The  mean  and  standard  deviation  of  the  generalization  error  G  of  the  learned 
function  obtained  by  each  method  over  1000  runs  are  described  in  Table  2.  The  table  shows 
that  IWCV  gives  significantly  smaller  test  errors  than  ordinary  CV,  and  the  performance 
of  IWCV  is  rather  close  to  that  of  the  optimal  choice. 


5  Importance  Estimation 


In  the  previous  sections,  we  have  seen  that  the  importance  weight 
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(a)  True  generalization  error 


Figure  6:  Generalization  error  and  its  estimates  obtained  by  IWCV  and  ordinary  CV  as 
functions  of  the  flattening  parameter  7  in  IWLS  for  the  regression  examples  in  Figure  4. 
Thick  dashed  curves  in  the  bottom  graphs  depict  the  true  generalization  error  for  clear 
comparison. 


Table  1:  The  mean  and  standard  deviation  of  the  generalization  error  G  obtained  by  each 
method  for  the  toy  regression  data  set.  The  best  method  and  comparable  ones  by  the 
t-test  at  the  significance  level  5%  are  indicated  by  ‘o’.  For  reference,  the  generalization 
error  obtained  with  the  optimal  7  (i.e.,  the  minimum  generalization  error)  is  described  as 
‘Optimal’. 

IWCV  Ordinary  CV  Optimal 


0.077  ±0.020  0.356  ±0.086  0.069  ±  0.011 
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(a)  True  generalization  error 


(b)  IWCV  score  (c)  Ordinary  CV  score 

Figure  7:  The  generalization  error  G  (i.e. ,  the  misclassihcation  rate)  and  its  estimates 
obtained  by  IWCV  and  ordinary  CV  as  functions  of  the  flattening  parameter  7  in  IWFDA 
for  the  toy  classification  examples  in  Figure  5.  Thick  dashed  curves  in  the  bottom  graphs 
depict  the  true  generalization  error  for  clear  comparison. 


Table  2:  The  mean  and  standard  deviation  of  the  generalization  error  G  (i.e.,  the  mis- 
classification  rate)  obtained  by  each  method  for  the  toy  classification  data  set.  The  best 
method  and  comparable  ones  by  the  t-test  at  the  significance  level  5%  are  indicated  by 
‘o’.  For  reference,  the  generalization  error  obtained  with  the  optimal  7  (i.e.,  the  minimum 
generalization  error)  is  described  as  ‘Optimal’. 

IWCV  Ordinary  CV  Optimal 


0.108  ±0.027  0.131  ±0.029  0.091  ±  0.009 
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plays  a  central  role  in  covariate  shift  adaptation.  However,  the  importance  weight  is 
unknown  in  practice  and  needs  to  be  estimated  from  data.  In  this  section,  we  review 
importance  estimation  methods. 

Here  we  assume  that  in  addition  to  the  training  input  samples  {ccf}"lir1  drawn  in¬ 
dependently  from  ptI(x),  we  are  given  test  input  samples  {Xje}j=i  drawn  independently 
from  pte(x).  Thus  the  goal  of  the  importance  estimation  problem  addressed  here  is  to 
estimate  the  importance  function  w(x)  from  {cch}^  and  {*}e}j=i  • 


5.1  Kernel  Density  Estimation 


Kernel  density  estimation  (KDE)  is  a  non-parametric  technique  to  estimate  a  probability 
density  function  p(x)  from  its  i.i.d.  samples  {aq}f=1.  For  the  Gaussian  kernel 


Ka{x,  x')  =  exp 


\x  —  x 


/  II 2 


2cr2 


(5) 


KDE  is  expressed  as 

1  n 

p(x)  =  M2™r/2^KAx'Xl)- 

The  performance  of  KDE  depends  on  the  choice  of  the  kernel  width  a.  It  can  be 
optimized  by  cross-validation  (CV)  as  follows  (Hardle  et  ah,  2004):  First,  divide  the 
samples  {aq}™=1  into  k  disjoint  non-empty  subsets  {Xr}^=1  of  (approximately)  the  same 
size.  Then  obtain  a  density  estimator  pxr  (x)  from  {Xi}i^r  (i.e. ,  without  Xr),  and  compute 
its  log-likelihood  for  the  hold-out  subset  Xr: 

l°SPxr(x), 


where  \X\  denotes  the  number  of  elements  in  the  set  X.  Repeat  this  procedure  for 
r  =  1,2 , ,k  and  choose  the  value  of  a  such  that  the  average  of  the  above  hold-out 
log-likelihood  over  all  r  is  maximized.  Note  that  the  average  hold-out  log-likelihood  is  an 
almost  unbiased  estimate  of  the  Kullback-Leibler  divergence  from  p(x)  to  p(x),  up  to  an 
irrelevant  constant. 

KDE  can  be  used  for  importance  estimation  by  first  obtaining  density  estimators  ptr(x ) 
and  pte{x)  separately  from  {a;jr}(Tr1  and  {cc)e}”T] ,  and  then  estimating  the  importance 
by  w(x)  =  Pte(x)/ptr(x).  However,  division  by  an  estimated  density  can  magnify  the 
estimation  error,  so  directly  estimating  the  importance  weight  in  a  single-shot  process 
would  be  more  preferable. 


5.2  Kullback-Leibler  Importance  Estimation  Procedure 

The  Kullback-Leibler  importance  estimation  procedure  (KLIEP)  (Sugiyama  et  ah,  2008) 
directly  gives  an  estimate  of  the  importance  function  without  going  through  density  esti¬ 
mation  by  matching  the  two  densities  ptT{x)  and  pte(x)  in  terms  of  the  Kullback-Leibler 
divergence  (Kullback  &  Leibler,  1951). 
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Let  us  model  the  importance  weight  w(x)  by  the  following  kernel  model: 


n  te 

w(x )  =  yqeKa(x,xt), 

1=1 


where  ct  =  (aq,  0:2,  •  •  • ,  ccnte)T  are  parameters  to  be  learned  from  data  samples  and 
Ka(x,x')  is  the  Gaussian  kernel  (see  Eq.(5)).  An  estimate  of  the  density  pte{x)  is  given 
by  using  the  model  w(x)  as  Pte(x )  =  w(x)ptT(x).  In  KLIEP,  the  parameters  ot  are 
determined  so  that  the  Kullback-Leibler  divergence  from  pte(x)  to  pte(x)  is  minimized: 


KL(«) 


E 

Xte 
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-  E  [log  fh(a3te)]  , 

Xte 


where  Exte  denotes  the  expectation  over  xte  drawn  from  pte(x).  The  first  term  is  a 
constant,  so  it  can  be  safely  ignored.  We  define  the  negative  of  the  second  term  by  KL7: 


KL'(o:)  :=  E  [log  w;(:rte)]  . 

Xte 


(6) 


Since  Pte(x)  (=  w(x)ptl.(x))  is  a  probability  density  function,  it  should  satisfy 


1  =  /  pte(x)dx  =  /  w(x)ptr(x)dx  =  E  [ui(a3tr)]  . 


(7) 
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The  KLIEP  optimization  problem  is  given  by  replacing  the  expectations  in  Eqs.(6)  and 
(7)  with  empirical  averages: 


max 
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1  and  «1,  a2, . . . ,  ante  >  0. 


This  is  a  convex  optimization  problem  and  the  global  solution — which  tends  to  be  sparse 
(Boyd  &  Vandenberghe,  2004) — can  be  obtained,  e.g.,  by  simply  performing  gradient 
ascent  and  feasibility  satisfaction  iteratively.  A  pseudo  code  is  summarized  in  Figure  8. 
The  Gaussian  width  a  can  be  optimized  by  CV  over  KL7,  where  only  the  test  samples 
{cE)e}”Lj  are  divided  into  k  disjoint  subsets  (Sugiyama  et  ah,  2008). 

A  MATLAB®  implementation  of  the  entire  KLIEP  algorithm  is  available  from  the 
following  web  page. 

http://sugiyama-www.cs.titech.ac.jp/~sugi/software/KLIEP/ 
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Input:  {xf}^,  and  cr 

Output:  w(x) 

Ajf  < —  Ka(xf,  xf)  for  j,£  =  1,2,...,  nte; 
be  < —  yy  YTi= i  Ka(xf,  xf)  for  i  =  1, 2, . . . ,  nte; 

Initialize  a:  (>  0nte)  and  e  (0  <  e  1); 

Repeat  until  convergence 

ol  < —  o  +  £j4T(lnte./Aa);  %  Gradient  ascent 
o  < —  a  +  (1  —  bT ct)b/(bTb)]  %  Constraint  satisfaction 
a  i —  max(Orete,  a);  %  Constraint  satisfaction 
o  < —  o/(bJ o)\  %  Constraint  satisfaction 

end 

w(x)  < —  Y^=iaeKAxixT)', 

Figure  8:  Pseudo  code  of  KLIEP.  0nte  denotes  the  nte- dimensional  vector  with  all  zeros, 
and  lnte  denotes  the  nte-dimensional  vector  with  all  ones.  ‘ indicates  the  element-wise 
division,  and  inequalities  and  the  ‘max’  operation  for  vectors  are  applied  in  the  element¬ 
wise  manner. 

5.3  Numerical  Examples 

Here,  we  illustrate  the  behavior  of  the  KLIEP  method. 

Let  us  consider  the  following  one-dimensional  importance  estimation  problem: 

ptr(x)  =  N(x ;  1,  (1/2)2)  and  pte(x)  =  N(x ;  2,  (1/4)2). 

Let  the  number  of  training  samples  be  ntr  =  200  and  the  number  of  test  samples  be 
nte  =  1000. 

Figure  9  depicts  the  true  importance  and  its  estimates  by  KLIEP,  where  three  different 
Gaussian  widths  a  =  0.02,  0.2,  0.8  are  tested.  The  graphs  show  that  the  performance 
of  KLIEP  is  highly  dependent  on  the  Gaussian  width.  More  specifically,  the  estimated 
importance  function  w(x)  is  highly  fluctuated  when  a  is  small,  while  it  is  overly  smoothed 
when  a  is  large.  When  a  is  chosen  appropriately,  KLIEP  seems  to  work  reasonably  well 
for  this  example. 

Figure  10  depicts  the  values  of  the  true  J  (see  Eq.(6))  and  its  estimate  by  5-fold 
CV;  the  means,  the  25  percentiles,  and  the  75  percentiles  over  100  trials  are  plotted  as 
functions  of  the  Gaussian  width  a.  This  shows  that  CV  gives  a  very  good  estimate  of  J, 
which  results  in  an  appropriate  choice  of  a. 

6  Conclusions  and  Outlook 

In  standard  supervised  learning  theories,  test  input  points  are  assumed  to  follow  the 
same  probability  distribution  as  training  input  points.  However,  this  assumption  is  often 
violated  in  real-world  learning  problems.  In  this  chapter,  we  reviewed  recently  proposed 
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(a)  Gaussian  width  er  =  0.02  (b)  Gaussian  width  a  =  0.2 


(c)  Gaussian  width  a  =  0.8 

Figure  9:  Results  of  importance  estimation  by  KLIEP.  w(x)  is  the  true  importance  func¬ 
tion  and  w(x)  is  its  estimation  obtained  by  KLIEP. 


techniques  for  covariate  shift  adaptation,  including  importance-weighted  empirical  risk 
minimization,  importance-weighted  cross-validation,  and  direct  importance  estimation. 

In  Section  5,  we  introduced  the  KLIEP  algorithm  for  importance  estimation,  where 
linearly-parameterized  models  were  used.  It  was  shown  that  the  KLIEP  idea  can  also 
be  naturally  applied  to  log-linear  models  (Tsuboi  et  al.,  2009),  Gaussian  mixture  mod¬ 
els  (Yamada  &  Sugiyama,  2009),  and  probabilistic  principal  component  analysis  mixture 
models  (Yamada  et  ah,  2010b).  Other  than  KLIEP,  various  methods  of  direct  impor¬ 
tance  estimation  have  also  been  proposed  (Silverman,  1978;  Cwik  &  Miclniczuk,  1989; 
Qin,  1998;  Cheng  &  Chu,  2004;  Huang  et  ah,  2007;  Bickel  et  ah,  2007;  Kanamori  et  ah, 
2009a).  Among  them,  the  method  proposed  in  Kanamori  et  ah  (2009a)  called  uncon- 
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o  (Gaussian  Width) 


Figure  10:  Model  selection  curve  for  KLIEP.  KL'  is  the  true  score  of  an  estimated  impor¬ 
tance  (see  Eq.(6))  and  KL'cv  is  its  estimate  by  5-fold  CV. 


strained  least-squares  importance  fitting  (uLSIF)  gives  an  analytic- form  solution  and  the 
solution  can  be  computed  very  efficiently  in  a  stable  manner.  Thus  it  can  be  applied  to 
large-scale  data  sets. 

Recently,  importance  estimation  methods  which  incorporate  dimensionality  reduction 
have  been  developed.  A  method  proposed  by  Sugiyama  et  al.  (2010a)  uses  a  supervised 
dimensionality  reduction  technique  called  local  Fisher  discriminant  analysis  (Sugiyama, 
2007)  for  identifying  a  subspace  in  which  two  densities  are  significantly  different  (which 
is  called  the  hetero-distributional  subspace ).  Another  method  proposed  by  Sugiyama 
et  al.  (2011)  tries  to  find  the  hetero-distributional  subspace  by  directly  minimizing  the 
discrepancy  between  the  two  distributions.  Theoretical  analysis  of  importance  estimation 
has  also  been  conducted  thoroughly  (Silverman,  1978;  Cwik  &  Mielniczuk,  1989;  Gijbels 
&  Mielniczuk,  1995;  Jacob  &  Oliveira,  1997;  Qin,  1998;  Cheng  &  Chu,  2004;  Bensaid  & 
Fabre,  2007;  Nguyen  et  ah,  2010;  Sugiyama  et  ah,  2008;  Chen  et  ah,  2009;  Kanamori 
et  ah,  2009b;  Kanamori  et  ah,  2010). 

It  has  been  shown  that  various  statistical  data  processing  tasks  can  be  solved  through 
importance  estimation  (Sugiyama  et  ah,  2009;  Sugiyama  et  ah,  2012),  including  multi-task 
learning  (Bickel  et  ah,  2007),  inlier-based  outlier  detection  (Silverman,  1978;  Hido  et  ah, 
2008;  Srnola  et  ah,  2009;  Hido  et  ah,  2011),  change  detection  in  time  series  (Kawahara  & 
Sugiyama,  2011),  mutual  information  estimation  (Suzuki  et  ah,  2008;  Suzuki  et  ah,  2009b), 
independent  component  analysis  (Suzuki  &  Sugiyama,  2011),  feature  selection  (Suzuki 
et  ah,  2009a),  sufficient  dimension  reduction  (Suzuki  &  Sugiyama,  2010),  causal  inference 
(Yamada  &  Sugiyama,  2010),  conditional  density  estimation  (Sugiyama  et  ah,  2010b), 
and  probabilistic  classification  (Sugiyama,  2010).  Thus,  following  this  line  of  research, 
further  improving  the  accuracy  and  computational  efficiency  of  importance  estimation 
as  well  as  further  exploring  possible  application  of  importance  estimation  would  be  a 
promising  direction  to  be  pursued. 
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Abstract 

The  goal  of  cross-domain  object  matching 
(CDOM)  is  to  find  correspondence  between 
two  sets  of  objects  in  different  domains  in 
an  unsupervised  way.  Photo  album  summa¬ 
rization  is  a  typical  application  of  CDOM, 
where  photos  are  automatically  aligned  into 
a  designed  frame  expressed  in  the  Cartesian 
coordinate  system.  CDOM  is  usually  for¬ 
mulated  as  finding  a  mapping  from  objects 
in  one  domain  (photos)  to  objects  in  the 
other  domain  (frame)  so  that  the  pairwise 
dependency  is  maximized.  A  state-of-the-art 
CDOM  method  employs  a  kernel-based  de¬ 
pendency  measure,  but  it  has  a  drawback 
that  the  kernel  parameter  needs  to  be  de¬ 
termined  manually.  In  this  paper,  we  pro¬ 
pose  alternative  CDOM  methods  that  can 
naturally  address  the  model  selection  prob¬ 
lem.  Through  experiments  on  image  match¬ 
ing,  unpaired  voice  conversion,  and  photo  al¬ 
bum  summarization  tasks,  the  effectiveness 
of  the  proposed  methods  is  demonstrated. 


1  Introduction 

The  objective  of  cross-domain  object  matching 
(CDOM)  is  to  match  two  sets  of  objects  in  different 
domains.  For  instance,  in  photo  album  summariza¬ 
tion,  photos  are  automatically  assigned  into  a  designed 
frame  expressed  in  the  Cartesian  coordinate  system. 
A  typical  approach  of  CDOM  is  to  find  a  mapping 
from  objects  in  one  domain  (photos)  to  objects  in  the 
other  domain  (frame)  so  that  the  pairwise  dependency 
is  maximized.  In  this  scenario,  accurately  evaluating 
the  dependence  between  objects  is  a  key  challenge. 
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Kernelized  sorting  (KS)  (Jebara,  2004)  tries  to  find 
a  mapping  between  two  domains  that  maximizes  the 
mutual  information  (MI)  (Cover  and  Thomas,  2006) 
under  the  Gaussian  assumption.  However,  since  the 
Gaussian  assumption  may  not  be  fulfilled  in  practice, 
this  method  (which  we  refer  to  as  KS-MI)  tends  to 
perform  poorly. 

To  overcome  the  limitation  of  KS-MI,  Quadrianto 
et  al.  (2010)  proposed  using  the  kernel-based  depen¬ 
dence  measure  called  the  Hilbert-S chmidt  indepen¬ 
dence  criterion  (HSIC)  (Gretton  et  al. ,  2005)  for  KS. 
Since  HSIC  is  distribution-free,  KS  with  HSIC  (which 
we  refer  to  as  KS-HSIC)  is  more  flexible  than  KS- 
MI.  However,  HSIC  includes  a  tuning  parameter  (more 
specifically,  the  Gaussian  kernel  width),  and  its  choice 
is  crucial  to  obtain  better  performance  (see  also  Ja- 
garlamudi  et  al.,  2010).  Although  using  the  median 
distance  between  sample  points  as  the  Gaussian  ker¬ 
nel  width  is  a  common  heuristic  in  kernel-based  de¬ 
pendence  measures  (see  e.g.,  Fukumizu  et  al.,  2009a), 
this  does  not  always  perform  well  in  practice. 

In  this  paper,  we  propose  two  alternative  CDOM 
methods  that  can  naturally  address  the  model  se¬ 
lection  problem.  The  first  method  employs  another 
kernel-based  dependence  measure  based  on  the  nor¬ 
malized  cross-covariance  operator  (NOCCO)  (Fuku¬ 
mizu  et  al,  2009b),  which  we  refer  to  as  KS-NOCCO. 
The  NOCCO-based  dependence  measure  was  shown  to 
be  asymptotically  independent  of  the  choice  of  kernels. 
Thus,  KS-NOCCO  is  expected  to  be  less  sensitive  to 
the  kernel  parameter  choice,  which  is  an  advantage 
over  HSIC. 

The  second  method  uses  least-squares  mutual  infor¬ 
mation  (LSMI)  (Suzuki  et  al.,  2009)  as  the  depen¬ 
dence  measure,  which  is  a  consistent  estimator  of  the 
squared-loss  mutual  information  (SMI)  achieving  the 
optimal  convergence  rate.  We  call  this  method  least- 
squares  object  matching  (LSOM).  An  advantage  of 
LSOM  is  that  cross-validation  (CV)  with  respect  to  the 
LSMI  criterion  is  possible.  Thus,  all  the  tuning  param¬ 
eters  such  as  the  Gaussian  kernel  width  and  the  regu¬ 
larization  parameter  can  be  objectively  determined  by 


Cross-Domain  Object  Matching  with  Model  Selection 


CV. 

Through  experiments  on  image  matching,  unpaired 
voice  conversion,  and  photo  album  summarization 
tasks,  LSOM  is  shown  to  be  the  most  promising  ap¬ 
proach  to  CDOM. 

2  Problem  Formulation 

In  this  section,  we  formulate  the  problem  of  cross¬ 
domain  object  matching  (CDOM). 

The  goal  of  CDOM  is,  given  two  sets  of  samples  of  the 
same  size,  {a h}”=1  and  {y*} "=1,  to  find  a  mapping  that 
well  “matches”  them. 

Let  7T  be  a  permutation  function  over  {1, . . . ,  n},  and 
let  II  be  the  corresponding  permutation  indicator  ma¬ 
trix,  i.e., 

n  G  {0,  i}nxn,  m„  =  in,  and  nTi„  =  in, 

where  1„  is  the  n-dimensional  vector  with  all  ones  and 
T  denotes  the  transpose.  Let  us  denote  the  samples 
matched  by  a  permutation  tt  by 

z( n)  !/*(<)) }"=1. 

The  optimal  permutation,  denoted  by  II*,  can  be  ob¬ 
tained  as  the  maximizer  of  the  dependency  between 
the  two  sets  {xi}f=1  and  {yi}"=1: 

IP  :=  argmaxD(Z(n)), 

n 

where  D  is  some  dependence  measure. 

3  Existing  Methods 

In  this  section,  we  review  two  existing  methods  for 
CDOM,  and  point  out  their  potential  weaknesses. 

3.1  Kernelized  Sorting  with  Mutual 
Information 


respectively.  MI  is  zero  if  and  only  if  x  and  y  are  in¬ 
dependent,  and  thus  it  may  be  used  as  a  dependency 
measure.  Let  H(X),  H(Y ),  and  H(X,Y)  be  the  en¬ 
tropies  of  X  and  Y  and  the  joint  entropy  of  X  and  Y, 
respectively: 

H(X)  =  -  J  p( x)  logp(x)dx, 

H(Y)  =  -  J p(y)  logp(y)dy, 

H(X,Y)  =  -  JJ  p(x,y)  log p(x,y)dxdy. 

Then  MI  between  X  and  Y  can  be  written  as 

MI (Z)  =  H(X)  +  H(Y)  -  H{X,  Y). 

Since  H(X)  and  H(Y)  are  independent  of  permuta¬ 
tion  II,  maximizing  MI  is  equivalent  to  minimizing 
the  joint  entropy  H(X,  Y).  If  p(x,  y)  is  Gaussian  with 
covariance  matrix  X,  the  joint  entropy  is  expressed  as 

H(X,  Y)  =  \  log  |X|  +  Const., 

where  |X|  denotes  the  determinant  of  matrix  X. 

Now,  let  us  assume  that  x  and  y  are  jointly  normal 
in  some  reproducing  Kernel  Hilbert  Spaces  (RKHSs) 
endowed  with  joint  kernel  K(x,x')L(y,y'),  where 
K( x,x')  and  L(y,y')  are  reproducing  kernels  for  x 
and  y,  respectively.  Then  KS-MI  is  formulated  as  fol¬ 
lows: 

miniog|r(iro(nTin))r|,  (i) 

where  K  =  {K{xi,xj)}lj=1  and  L  =  {L{yu  yj)}^-=1 
are  kernel  matrices,  o  denotes  the  Hadamarcl  product 
(a.k.a.  the  element-wise  product),  T  =  In  — 
is  the  centering  matrix,  and  In  is  the  n-dimensional 
identity  matrix. 

A  critical  weakness  of  KS-MI  is  the  Gaussian  assump¬ 
tion,  which  may  not  be  fulfilled  in  practice. 


Kernelized  sorting  with  mutual  information  (KS-MI) 
(Jebara,  2004)  matches  objects  in  different  domains  so 
that  MI  between  matched  pairs  is  maximized.  Here, 
we  review  KS-MI  following  alternative  derivation  pro¬ 
vided  in  Quadrianto  et  al.  (2010). 

MI  is  one  of  the  popular  dependence  measures  between 
random  variables.  For  random  variables  X  and  Y,  MI 
is  defined  as  follows  (Cover  and  Thomas,  2006): 

MI(Z)  :=  II p(x,  y ) log  fXXdxiy, 

where  p(x,  y)  denotes  the  joint  density  of  x  and  y, 
and  p{x)  and  p{y)  are  marginal  densities  of  x  and  y, 


3.2  Kernelized  Sorting  with  Hilbert-Schmidt 
Independence  Criterion 

Kernelized  sorting  with  Hilbert-Schmidt  independence 
criterion  (KS-HSIC)  matches  objects  in  different  do¬ 
mains  so  that  HSIC  between  matched  pairs  is  maxi¬ 
mized. 

HSIC  is  a  kernel-based  dependence  measure  given  as 
follows  (Gretton  et  al.,  2005): 

HSIC(Z)  =  tr(-KX), 

where  K  =  TKT  and  L  =  TLT  are  the  centered 
kernel  matrices  for  x  and  y,  respectively.  Note  that 
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smaller  HSIC  scores  mean  that  X  and  Y  are  closer  to 
be  independent. 

KS-HSIC  is  formulated  as  follows  (Quadrianto  et  al., 
2010): 

maxHSIC(Z(II)),  (2) 

where 

HSIC(Z(II))  =  tr(KUTLH).  (3) 

This  optimization  problem  is  called  the  quadratic  as¬ 
signment  problem  (QAP)  (Finke  et  al.,  1987),  and 
it  is  known  to  be  NP-hard.  There  exists  several 
QAP  solvers  based  on,  e.g.,  simulated  annealing,  tabu 
search,  and  genetic  algorithms.  However,  those  QAP 
solvers  are  not  easy  to  use  in  practice  since  they  con¬ 
tain  various  tuning  parameters. 

Another  approach  to  solving  Eq.(2)  based  on  a  lin¬ 
ear  assignment  problem  (LAP)  (Kuhn,  1955)  was  pro¬ 
posed  in  Quadrianto  et  al.  (2010),  which  is  explained 
below.  Let  us  relax  the  permutation  indicator  matrix 
II  to  take  real  values: 

n  e  [o,  i]nxn ,  mn  =  1„,  and  nTi„  =  in.  (4) 

Then,  Eq.(3)  is  convex  with  respect  to  II  (see  Lemma 
7  in  Quadrianto  et  al.,  2010),  and  its  lower  bound  can 
be  obtained  using  some  II  as  follows: 

tr(imTZn) 

>  tr(FsrnTin)  +  (n  n, - ' 

=  2tr(KnTZn)  tr(KIlTLfl), 

where  (•,•)  denotes  the  inner  product  between  matri¬ 
ces.  Based  on  the  above  lower  bound,  Quadrianto 
et  al.  (2010)  proposed  to  update  the  permutation  ma¬ 
trix  as 

nnew  =  (1  -  ?7)IIold  +  ryargmaxtr  (nTLIIoldIT)  , 

n 

(5) 

where  0  <  r]  <  1  is  a  step  size.  The  second  term  is  an 
LAP  subproblem,  which  can  be  efficiently  solved  by 
using  the  Hungarian  method  (Kuhn,  1955). 

In  the  original  KS-HSIC  paper  (Quadrianto  et  al., 
2010),  a  C++  implementation  of  the  Hungarian 
method  provided  by  Cooper1  was  used  for  solving 
Eq.(5);  then  II  is  kept  updated  by  Eq.(5)  until  con¬ 
vergence. 

In  this  iterative  optimization  procedure,  the  choice  of 
initial  permutation  matrices  is  critical  to  obtain  a  good 

1http://mit.edu/harold/www/code.html 


solution.  Quadrianto  et  al.  (2010)  proposed  the  follow¬ 
ing  initialization  scheme.  Suppose  the  kernel  matrices 
K  and  L  are  rank  one,  i.e. ,  for  some  /  and  g ,  K  and 
L  can  be  expressed  as  K  =  ffT  and  L  =  ggT .  Then 
HSIC  can  be  written  as 

HSic(z(n))  =  \\fTng\\2.  (6) 

The  initial  permutation  matrix  is  determined  so  that 
Eq.(6)  is  maximized.  According  to  Theorems  368  and 
369  in  Hardy  et  al.  (1952),  the  maximum  of  Eq.(6)  is 
attained  when  the  elements  of  f  and  II  g  are  ordered  in 
the  same  way.  That  is,  if  the  elements  of  /  are  ordered 
in  the  ascending  manner  (i.e.,  fi  <  fl  <  ■  ■  ■  <  fn), 
the  maximum  of  Eq.(6)  is  attained  by  ordering  the 
elements  of  g  in  the  same  ascending  way.  However, 
since  the  kernel  matrices  K  and  L  may  not  be  rank 
one  in  practice,  the  principal  eigenvectors  of  K  and 
L  were  used  as  f  and  g  in  the  original  KS-HSIC  pa¬ 
per  (Quadrianto  et  al.,  2010).  We  call  this  eigenvalue- 
based  initialization. 

Since  HSIC  is  a  distribution-free  dependence  measure, 
KS-HSIC  is  more  flexible  than  KS-MI.  However,  a  crit¬ 
ical  weakness  of  HSIC  is  that  its  performance  is  sensi¬ 
tive  to  the  choice  of  kernels  (Jagarlamudi  et  al.,  2010). 
A  practical  heuristic  is  to  use  the  Gaussian  kernel  with 
width  set  to  the  median  distance  between  samples  (see 
e.g.,  Fukumizu  et  al,  2009a),  but  this  does  not  always 
work  well  in  practice. 

4  Proposed  Methods 

In  this  section,  we  propose  two  alternative  CDOM 
methods  that  can  naturally  address  the  model  selec¬ 
tion  problem. 

4.1  Kernelized  Sorting  with  Normalized 
Cross-Covariance  Operator 

The  kernel-based  dependence  measure  based  on 
the  normalized  cross-covariance  operator  (NOCCO) 
(Fukumizu  et  al.,  2009b)  is  given  as  follows  (Fukumizu 
et  al.,  2009b): 

Dnocco(^)  =  tr  (KL), 

where  K  =  K{K  +  neln )_1,  L  =  L{L  +  ne/„)_1, 
and  e  >  0  is  a  regularization  parameter.  Dnocco  w&s 
shown  to  be  asymptotically  independent  of  the  choice 
of  kernels.  Thus,  KS  with  Dnocco  (KS-NOCCO)  is 
expected  to  be  less  sensitive  to  the  kernel  parameter 
choice  than  KS-HSIC. 
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The  permuted  version  of  L  can  be  written  as 

l{  n)  =  nTIn(nT£n  +  nei^)-1 

=  uT  L(L  +  nei„)~1u 

=  nTZn, 

where  we  used  the  orthogonality  of  II  (i.e.,  II 1  II  = 
nnT  =  In).  Thus,  the  dependency  measure  for  Z(H) 
can  be  written  as 

DNocco(^(n))  =  tr(xnTZn). 

Since  this  is  essentially  the  same  form  as  HSIC,  a  local 
optimal  solution  may  be  obtained  in  the  same  way  as 
KS-HSIC: 

nnew  =  (x  _  7?)nold  +  ry  argrnax tr  (uTLUoldK^  . 

(7) 

However,  the  property  that  Dnocco  is  independent 
of  the  kernel  choice  holds  only  asymptotically.  Thus, 
with  finite  samples,  Dnocco  does  still  depend  on  the 
choice  of  kernels  as  well  as  the  regularization  parame¬ 
ter  e  which  needs  to  be  manually  tuned. 


4.2  Least-Squares  Object  Matching 


Next,  we  propose  an  alternative  method  called  least- 
squares  object  matching  (LSOM),  in  which  we  em¬ 
ploy  least-squares  mutual  information  (LSMI)  (Suzuki 
et  al,  2009)  as  a  dependency  measure.  LSMI  is  a  con¬ 
sistent  estimator  of  the  squared-loss  mutual  informa¬ 
tion  (SMI)  with  the  optimal  convergence  rate.  SMI  is 
defined  and  expressed  as 


SMI(Z) 


/  p(x,  y) 
\p(x)p(y) 
f  P(x,y) 
\p(x)p(y) 


~  1 J  p(x)p(y)dxdy 
)  p(x,y)dxdy  - 


(8) 


Note  that  SMI  is  the  Pearson  divergence  (Pearson, 
1900)  from  p{x,y)  to  p{x)p(y),  while  the  ordinary 
MI  is  the  Kullback-Leibler  divergence  (Kullback  and 
Leibler,  1951)  from  p(x,y)  to  p(x)p(y).  SMI  is  zero 
if  and  only  if  x  and  y  are  independent,  as  the  ordi¬ 
nary  MI.  Its  estimator  LSMI  is  given  as  follows  (Suzuki 
et  al. ,  2009)  (see  Appendix  for  the  derivation  of  LSMI) : 

LSMI(Z)  =  i, 

where 


a  =  {H  +  Xl^h, 
H=\(KKt)o(LLt), 

h=-(KoL)ln. 

n 


Here,  A  (>  0)  is  the  regularization  parameter.  Since 
cross-validation  (CV)  with  respect  to  SMI  is  possi¬ 
ble  for  model  selection,  tuning  parameters  in  LSMI 
(i.e.,  the  kernel  parameters  and  the  regularization  pa¬ 
rameter)  can  be  objectively  optimized.  This  is  a  no¬ 
table  advantage  over  kernel-based  approaches  such  as 
KS-HSIC  and  KS-NOCCO,  since  the  choice  of  ker¬ 
nels  heavily  affects  the  sensitivity  of  the  independence 
measure  in  the  kernel-based  independence  measures 
(Fukumizu  et  al.,  2009a). 

Below,  we  use  the  following  equivalent  expression  of 
LSMI: 

LSMI(Z)  =  -Ur  (lAk)-±,  (9) 

where  A  is  the  diagonal  matrix  with  diagonal  elements 
given  by  a.  Note  that  we  used  Eq.(73)  and  Eq.(75)  in 
Minka  (2000)  for  obtaining  the  above  expression. 

LSMI  for  the  permuted  data  Z(II)  is  given  by 

LSMI(Z(n))  =  -Ur  (nT LTlAnK^J  -  i, 

where  An  is  the  diagonal  matrix  with  diagonal  ele¬ 
ments  given  by  an,  and  Sn  is  given  by 

an  =  ( Hn  +  XIn)~1hn, 

Hn  =  4  (KKT)  o  (nTLLTn), 

hn  =  —  (k  o  (nTin))  i„. 

n 

Consequently,  LSOM  is  formulated  as  follows: 
max  LSMI(Z(n)). 

Since  this  optimization  problem  is  in  general  NP-hard 
and  is  not  convex,  we  simply  use  the  same  optimization 
strategy  as  KS-HSIC,  i.e.,  for  the  current  IIold,  the 
solution  is  updated  as 

jjnew  _ 

(1  -  7?)nold  +  T)  argrnax  tr  (nTLnold!noidK:)  . 

(10) 


5  Experiments 


In  this  section,  we  experimentally  evaluate  our  pro¬ 
posed  algorithms  in  the  image  matching,  unpaired 
voice  conversion,  and  photo  album  summarization 
tasks. 


In  all  the  methods,  we  use  the  Gaussian  kernels: 


K(x ,  x')  =  exp  I  — 


r.012 


L(y,y')  =  exp  (  - 


2cr2 

I y  -  y' II2 

2ay 
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(a)  KS-HSIC  with  different  Gaussian 
kernel  widths. 
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(b)  KS-NOCCO  with  different  Gaus¬ 
sian  kernel  widths  and  regularization 
parameters. 
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(c)  LSOM  (tuned  by  CV),  optimally- 
tuned  KS-NOCCO,  and  optimally- 
tuned  KS-HSIC. 


Figure  1:  Image  matching  results.  The  best  method  in  terms  of  the  mean  error  and  comparable  methods 
according  to  the  t-test  at  the  significance  level  1%  are  specified  by  ‘o’. 


and  we  set  the  maximum  number  of  iterations  for  up¬ 
dating  permutation  matrices  to  20  and  the  step  size  rj 
to  1.  To  avoid  falling  into  undesirable  local  optima,  op¬ 
timization  is  carried  out  10  times  with  different  initial 
permutation  matrices,  which  are  determined  by  the 
eigenvalue-based  initialization  heuristic  with  Gaussian 
kernel  widths 

((7x,cry)  =  c  x  (mx,  rny), 

where  c  =  l1/2, 21/2, . . . ,  101/2,  and 

mx  =  2^1/2median({||a:i  - 

my  =  2-1/2median({||yi  -  %||}"J=1). 

In  KS-HSIC  and  KS-NOCCO,  we  use  the  Gaussian 
kernel  with  the  following  widths: 

(CTx,cry)  =  d  x  (mx,my), 

where  d  =  l1/2, 101/2.  In  KS-NOCCO,  we  use  the 
following  regularization  parameters: 

e  =  0.01,0.05. 

In  LSOM,  we  choose  the  model  parameters  of  LSMI, 
crx,  <ry,  and  A  by  2-fold  CV  from 

(<Tx,cry)  =  c  x  (mx,my), 

A  =  10-1,10-2,10-3. 

5.1  Image  Matching 

Let  us  consider  a  toy  image  matching  problem.  In 
this  experiment,  we  use  images  with  RGB  format  used 
in  Quadrianto  et  al.  (2010),  which  were  originally  ex¬ 
tracted  from  Flickr2.  We  first  convert  the  images  from 

2http://www. ffickr.com 


Figure  2:  Image  matching  result  by  LSOM.  In  this 
case,  234  out  of  320  images  (73.1%)  are  matched  cor¬ 
rectly. 


RGB  to  Lab  space  and  resize  them  to  40  x  40  pixels. 
Next,  we  convert  an  image  into  a  4800-dimensional 
vector  (4800  =  40  x  40  x  3).  Then,  we  vertically  divide 
images  of  size  40  x  40  pixels  in  the  middle,  and  make 
two  sets  of  half-images  and  {y;}"=1.  Given 

that  {2/i}"=1  is  randomly  permuted,  the  goal  is  to  re¬ 
cover  the  correct  correspondence. 

Figure  1  summarizes  the  average  correct  matching 
rate  over  100  runs  as  functions  of  the  number  of  im¬ 
ages,  showing  that  the  proposed  LSOM  method  tends 
to  outperform  the  best  tuned  KS-NOCCO  and  KS- 
NOCCO  methods.  Note  that  the  tuning  parameters 
of  LSOM  (crx,  <7y,  and  A)  are  automatically  tuned  by 
CV.  Figure  2  depicts  an  example  of  image  matching 
results  obtained  by  LSOM,  showing  that  most  of  the 
images  are  correctly  matched. 
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Figure  3:  True  spectral  envelopes  and  their  estimates. 


5.2  Unpaired  Voice  Conversion 

Next,  we  consider  an  unpaired  voice  conversion  task, 
which  is  aimed  at  matching  the  voice  of  a  source 
speaker  with  that  of  a  target  speaker. 

In  this  experiment,  we  use  200  short  utterance  sam¬ 
ples  recorded  from  two  male  speakers  in  French,  with 
sampling  rate  44.1kHz.  We  first  convert  the  utter¬ 
ance  samples  to  50-dimensional  line  spectral  frequen¬ 
cies  (LSF)  vector  (Kain  and  Macon,  1988).  We  denote 
the  source  and  target  LSF  vectors  by  x  and  y,  respec¬ 
tively.  Then  the  voice  conversion  task  can  be  regarded 
as  a  multi-dimensional  regression  problem  of  learning 
a  function  from  x  to  y.  However,  different  from  a  stan¬ 
dard  regression  setup,  paired  training  samples  are  not 
available;  instead,  only  unpaired  samples  {xi}™=1  and 
{Vi}i=  i  are  given. 

By  CDOM,  we  first  match  {a;i}”=1  and  {y,;}"=1,  and 
then  we  train  a  multi-dimensional  kernel  regression 
model  (Scholkopf  and  Smola,  2002)  using  the  matched 
samples  {(^(i),  2/;)}"=i  as 

71  r- 

mfn^  \\vi  -  WTk{x<i)) ||2  +  -tr (WTW), 

i—1 

where 


Figure  4:  Unpaired  voice  conversion  results.  The  best 
method  in  terms  of  the  mean  spectral  distortion  and 
comparable  methods  according  to  the  t-test  at  the  sig¬ 
nificance  level  1%  are  specified  by  ‘o’. 

over  100  runs  as  the  number  of  training  samples.  These 
results  show  that  the  proposed  LSOM  tends  to  outper¬ 
form  KS-NOCCO  and  KS-HSIC. 

5.3  Photo  Album  Summarization 


k(x)  =  (K(x,  xn{1)),  ...,K(x,  x^n)))T, 
„/||2' 


K(x ,  x')  =  exp  I  — 


\x  —  X 


2t2 


Here,  r  is  a  Gaussian  kernel  width  and  5  is  a  regular¬ 
ization  parameter;  they  are  chosen  by  2-fold  CV. 

We  repeat  the  experiments  100  times  by  randomly 
shuffling  training  and  test  samples,  and  evaluate  the 
voice  convergence  performance  by  log-spectral  distance 
for  8000  test  samples1  (Quackenbush  et  al. ,  1988).  Fig¬ 
ure  3  shows  the  true  spectral  envelope  and  their  es¬ 
timates,  and  Figure  4  shows  the  average  performance 

xThe  smaller  the  spectral  distortion  is,  the  better  the 
quality  of  voice  conversion  is. 


Finally,  we  apply  the  proposed  LSOM  method  to  a 
photo  album  summarization  problem,  where  photos 
are  automatically  aligned  into  a  designed  frame  ex¬ 
pressed  in  the  Cartesian  coordinate  system. 

First,  we  use  320  images  in  the  RGB  format  obtained 
from  Flickr2.  We  consider  a  rectangular  frame  of 
16  x  20  (=  320),  and  arrange  the  images  in  this  rect¬ 
angular  frame.  Figure  5(a)  depicts  the  photo  album 
summarization  result,  showing  that  images  are  aligned 
in  the  way  that  images  with  similar  colors  are  aligned 
closely. 

Similarly,  we  use  the  Frey  face  dataset  (Roweis  and 

2http:/ /www.  flickr.  com 


Makoto  Yamada,  Masashi  Sugiyama 


(a)  Layout  of  320  images  into  a  2D 
grid  of  size  16  by  20  using  LSOM. 


(b)  Layout  of  225  facial  images  into  a 
2D  grid  of  size  15  by  15  using  LSOM. 
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(c)  Layout  of  320  digit  ‘7’  into  a  2D 
grid  of  size  16  by  20  using  LSOM. 


Figure  5:  Images  are  automatically  aligned  into  rectangular  grid  frames  expressed  in  the  Cartesian  coordinate 
system. 


(a)  Layout  of  120  images  into  a 
Japanese  character  ‘mountain’  by 
LSOM. 


(b)  Layout  of  153  facial  images  into 
‘smiley’  by  LSOM. 


(c)  Layout  of  199  digit  ‘7’  into  ‘777’  by 
LSOM. 


Figure  6:  Images  are  automatically  aligned  into  complex  grid  frames  expressed  in  the  Cartesian  coordinate 
system. 


Saul,  2000),  which  consists  of  225  gray-scale  face  im¬ 
ages  with  28  x  20  (=  560)  pixels.  We  similarly  convert 
a  image  into  a  560-dimensional  vector,  and  we  set  the 
grid  size  to  15  x  15  (=  225).  The  results  depicted  in 
Figure  5(b)  show  that  similar  face  images  (in  terms  of 
the  angle  and  facial  expressions)  are  assigned  in  nearby 
cells  in  the  grid. 

Next,  we  apply  LSOM  to  the  USPS  hand-written  digit 
dataset  (Hastie  et  at,  2001).  In  this  experiment,  we 
use  320  gray-scale  images  of  digit  ‘7’  with  16  x  16 
(=  256)  pixels.  We  convert  an  image  into  a  256- 
dimensional  vector,  and  we  set  the  grid  size  to  16  x  20 
(=  320).  The  result  depicted  in  Figure  5(c)  shows  that 
digits  with  similar  profiles  are  aligned  closely. 

Finally,  we  align  the  Flickr,  Frey  face,  and  USPS  im¬ 
ages  into  more  complex  frames  -a  Japanese  charac¬ 
ter  ‘mountain’,  a  smiley-face  shape,  and  a  ‘777’  digit 
shape.  The  results  depicted  in  Figure  6  show  that  im¬ 
ages  with  similar  profiles  are  located  in  nearby  grid- 
coordinate  cells. 


6  Conclusion 

In  this  paper,  we  proposed  two  methods  of  cross¬ 
domain  object  matching  (CDOM).  The  first  method 
uses  the  dependence  measure  based  on  the  normalized 
cross-covariance  operator  (NOCCO),  which  is  advan¬ 
tageous  over  HSIC  in  that  NOCCO  is  asymptotically 
independent  of  the  choice  of  kernels.  However,  with 
finite  samples,  it  still  depends  on  kernels  which  need 
to  be  manually  tuned.  To  cope  with  this  problem, 
we  proposed  a  more  practical  CDOM  approach  called 
least-squares  object  matching  (LSOM).  LSOM  adopts 
squared-loss  mutual  information  as  a  dependence  mea¬ 
sure,  and  it  is  estimated  by  the  method  of  least-squares 
mutual  information  (LSMI).  A  notable  advantage  of 
the  LSOM  method  is  that  it  is  equipped  with  a  natural 
cross-validation  procedure  that  allows  us  to  objectively 
optimize  tuning  parameters  such  as  the  Gaussian  ker¬ 
nel  width  and  the  regularization  parameter  in  a  data- 
dependent  fashion.  We  applied  the  proposed  methods 
to  the  image  matching,  unpaired  voice  conversion,  and 
photo  album  summarization  tasks,  and  experimentally 
showed  that  LSOM  is  the  most  promising. 
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Then  H  and  h  can  be  rewritten  as  (Petersen  and  Ped¬ 
ersen,  2008) 

H=  \{KKt)o(LLt): 
nz 

h  =  (K  o  L)  ln. 


Appendix 


SMI  cannot  be  directly  computed  since  it  contains  un¬ 
known  densities  p(x,y)1  p(x),  and  p(y).  Here,  we 
briefly  review  an  SMI  estimator  called  least-squares 
mutual  information  (LSMI)  (Suzuki  et  al.,  2009). 


Suppose  that  we  are  given  n  independent  and  identi¬ 
cally  distributed  (i.i.d.)  paired  samples  {(xj,  ?/;)}£ 
drawn  from  a  joint  distribution  with  density  p(x,y). 
A  key  idea  of  LSMI  is  to  directly  estimate  the  density 
ratio : 


w{x,y) 


P(x,y) 

p{x)p{y)  ’ 


without  going  through  density  estimation  of  p(x,y), 
P{x),  and  p{y). 


In  LSMI,  the  density  ratio  function  w(x,  y)  is  directly 
modeled  by  the  following  linear  model: 


b 

wa{x,  y)  =  ^2,  cWtCx,  y)  =  acT<p(x,  y),  (11) 

t-\ 


where  b  is  the  number  of  basis  functions,  a  = 
(ai,. . .  ,atb)T  are  parameters,  and  cp(x,y)  = 
(ifii{x,  y), . . . ,  ifib(x,  y))T  are  basis  functions.  Note 
that,  we  set  b  =  n  in  this  paper. 

The  parameter  a  in  the  model  wa(x,y)  is  learned  so 
that  the  squared  error  between  w(x,y)  and  wa(x,y) 
—  this  is  formulated  as 


rl 


a  =  argmm 


-olt  Hoc  —  hT  a 
2 


A  oc  ol 


where  a  regularization  term  A ocT  oc  is  included  for 
avoiding  overfitting,  and 

—  1  " 

H=  n2  (p(xi>yj)‘p(xi,yj)T , 
i,j= 1 
1  n 

h  =  -  'Y\(p{xi,yi). 


Differentiating  the  above  objective  function  with  re¬ 
spect  to  oc  and  equating  it  to  zero,  we  can  obtain  an 
analytic-form  solution: 

a  =  ( H  +  XIb )_1h. 


Given  a  density  ratio  estimator  w  =  ws,  SMI  can  be 
simply  approximated  as 

LSMI(iT)  =  \ocTh- 


In  order  to  determine  the  kernel  parameter  and  the 
regularization  parameter  A,  cross-validation  (CV)  is 
available  for  the  LSMI  estimator:  First,  the  sam¬ 
ples  {(xi,yi)}l L1  are  divided  into  K  disjoint  subsets 
{Sk}k=i,  Sk  =  {(xfe,i,  yk,i)}i=!  of  (approximately)  the 
same  size,  where  nk  is  the  number  of  samples  in  the 
subset  Sk ■  Then,  an  estimator  ask  is  obtained  using 
and  the  approximation  error  for  the  hold-out 
samples  Sk  is  computed  as 


j(K-CV) 

Jsk 


where,  for  [KSk]ij  =  K(xi,xk,j),  [LSk)ij  =  L(yi,ykj ) 

i  =  1,-  ■  •  ,n,j  =  1,. . . ,  \Sk\, 

HSk  =  ^(KSkKl)o(LSkLl), 

nk 

hsk  =  —  (KSk  o  LSk )  lnk . 
nk 

This  procedure  is  repeated  for  k  =  1, . . . ,  K,  and  its 
average  j(K-CV) 

is  outputted  as 


j(K-CV)  =  —  ^  jj.X‘CV) 
fc= 1 


We  compute  jlK~CY)  for  all  model  candidates,  and 
choose  the  model  that  minimizes  j(K~cv) . 
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Abstract 

Information-maximization  clustering  learns 
a  probabilistic  classifier  in  an  unsupervised 
manner  so  that  mutual  information  between 
feature  vectors  and  cluster  assignments  is 
maximized.  A  notable  advantage  of  this 
approach  is  that  it  only  involves  continu¬ 
ous  optimization  of  model  parameters,  which 
is  substantially  easier  to  solve  than  dis¬ 
crete  optimization  of  cluster  assignments. 
However,  existing  methods  still  involve  non- 
convex  optimization  problems,  and  there¬ 
fore  finding  a  good  local  optimal  solution  is 
not  straightforward  in  practice.  In  this  pa¬ 
per,  we  propose  an  alternative  information- 
maximization  clustering  method  based  on  a 
squared-loss  variant  of  mutual  information. 

This  novel  approach  gives  a  clustering  so¬ 
lution  analytically  in  a  computationally  ef¬ 
ficient  way  via  kernel  eigenvalue  decompo¬ 
sition.  Furthermore,  we  provide  a  practical 
model  selection  procedure  that  allows  us  to 
objectively  optimize  tuning  parameters  in¬ 
cluded  in  the  kernel  function.  Through  ex¬ 
periments,  we  demonstrate  the  usefulness  of 
the  proposed  approach. 

1.  Introduction 

The  goal  of  clustering  is  to  classify  data  samples  into 
disjoint  groups  in  an  unsupervised  manner.  K-means 
is  a  classic  but  still  popular  clustering  algorithm.  How¬ 
ever,  since  k-means  only  produces  linearly  separated 
clusters,  its  usefulness  is  rather  limited  in  practice. 
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To  cope  with  this  problem,  various  non-linear  clus¬ 
tering  methods  have  been  developed.  Kernel  k-means 
(Girolami,  2002)  performs  k-means  in  a  feature  space 
induced  by  a  reproducing  kernel  function.  Spectral 
clustering  (Shi  &  Malik,  2000)  first  unfolds  non-linear 
data  manifolds  by  a  spectral  embedding  method,  and 
then  performs  k-means  in  the  embedded  space.  Blur¬ 
ring  mean-shift  (Fukunaga  &  Hostetler,  1975)  uses 
a  non-parametric  kernel  density  estimator  for  model¬ 
ing  the  data-generating  probability  density  and  finds 
clusters  based  on  the  modes  of  the  estimated  den¬ 
sity.  Discriminative  clustering  (Xu  et  ah,  2005;  Bach 
&  Harchaoui,  2008)  learns  a  discriminative  classifier 
for  separating  clusters,  where  class  labels  are  also 
treated  as  parameters  to  be  optimized.  Dependence- 
maximization  clustering  (Song  et  al.,  2007;  Faivi- 
shevsky  &  Goldberger,  2010)  determines  cluster  as¬ 
signments  so  that  their  dependence  on  input  data  is 
maximized. 

These  non-linear  clustering  techniques  would  be  capa¬ 
ble  of  handling  highly  complex  real-world  data.  How¬ 
ever,  they  suffer  from  lack  of  objective  model  selection 
strategies1.  More  specifically,  the  above  non-linear 
clustering  methods  contain  tuning  parameters  such  as 
the  width  of  Gaussian  functions  and  the  number  of 
nearest  neighbors  in  kernel  functions  or  similarity  mea¬ 
sures,  and  these  tuning  parameter  values  need  to  be 
heuristically  determined  in  an  unsupervised  manner. 
The  problem  of  learning  similarities/kernels  was  ad¬ 
dressed  in  earlier  works,  but  they  considered  super¬ 
vised  setups,  i.e.,  labeled  samples  are  assumed  to  be 
given.  Zelnik-Manor  &  Perona  (2005)  provided  a  use¬ 
ful  unsupervised  heuristic  to  determine  the  similarity 
in  a  data-dependent  way.  However,  it  still  requires  the 
number  of  nearest  neighbors  to  be  determined  man- 

1 ‘Model  selection’  in  this  paper  refers  to  the  choice  of 
tuning  parameters  in  kernel  functions  or  similarity  mea¬ 
sures,  not  the  choice  of  the  number  of  clusters. 
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ually  (although  the  magic  number  ‘7’  was  shown  to 
work  well  in  their  experiments). 

Another  line  of  clustering  framework  called 
information-maximization  clustering  (Agakov  & 
Barber,  2006;  Gomes  et  ah,  2010)  exhibited  the 
state-of-the-art  performance.  In  this  information- 
maximization  approach,  probabilistic  classifiers  such 
as  a  kernelized  Gaussian  classifier  (Agakov  &  Barber, 
2006)  and  a  kernel  logistic  regression  classifier  (Gomes 
et  ah,  2010)  are  learned  so  that  mutual  information 
(MI)  between  feature  vectors  and  cluster  assignments 
is  maximized  in  an  unsupervised  manner.  A  notable 
advantage  of  this  approach  is  that  classifier  training 
is  formulated  as  continuous  optimization  problems, 
which  are  substantially  simpler  than  discrete  opti¬ 
mization  of  cluster  assignments.  Indeed,  classifier 
training  can  be  carried  out  in  computationally  ef¬ 
ficient  manners  by  a  gradient  method  (Agakov  & 
Barber,  2006)  or  a  quasi-Newton  method  (Gomes 
et  al.,  2010).  Furthermore,  Agakov  &  Barber  (2006) 
provided  a  model  selection  strategy  based  on  the 
common  information-maximization  principle.  Thus, 
kernel  parameters  can  be  systematically  optimized  in 
an  unsupervised  way. 

However,  in  the  above  Mi-based  clustering  approach, 
the  optimization  problems  are  non-convex,  and  find¬ 
ing  a  good  local  optimal  solution  is  not  straightfor¬ 
ward  in  practice.  The  goal  of  this  paper  is  to  over¬ 
come  this  problem  by  providing  a  novel  information- 
maximization  clustering  method.  More  specifically, 
we  propose  to  employ  a  variant  of  MI  called  squared- 
loss  MI  (SMI),  and  develop  a  new  clustering  algo¬ 
rithm  whose  solution  can  be  computed  analytically  in 
a  computationally  efficient  way  via  eigenvalue  decom¬ 
position.  Furthermore,  for  kernel  parameter  optimiza¬ 
tion,  we  propose  to  use  a  non-parametric  SMI  esti¬ 
mator  called  least- squares  MI  (LSMI)  (Suzuki  et  ah, 
2009),  which  was  proved  to  achieve  the  optimal  con¬ 
vergence  rate  with  analytic-form  solutions.  Through 
experiments  on  various  real-world  datasets  such  as  im¬ 
ages,  natural  languages,  accelerometric  sensors,  and 
speech,  we  demonstrate  the  usefulness  of  the  proposed 
clustering  method. 


2.1.  Formulation  of  Information-Maximization 
Clustering 

Suppose  we  are  given  d-dimensional  i.i.d.  feature  vec¬ 
tors  of  size  n, 

{xi  |  Xi  e  Rd}"=i, 

which  are  assumed  to  be  drawn  independently  from  a 
distribution  with  density  p*(x).  The  goal  of  clustering 
is  to  give  cluster  assignments, 

{Vi  I  Vi  S  {l,...,c}}"=i, 

to  the  feature  vectors  {a Cj}"=1,  where  c  denotes  the 
number  of  classes.  Throughout  this  paper,  we  assume 
that  c  is  known. 

In  order  to  solve  the  clustering  problem,  we  take  the 
information-maximization  approach  (Agakov  &  Bar¬ 
ber,  2006;  Gomes  et  ah,  2010).  That  is,  we  regard  clus¬ 
tering  as  an  unsupervised  classification  problem,  and 
learn  the  class-posterior  probability  p*  (y \x)  so  that  ‘in¬ 
formation’  between  feature  vector  x  and  class  label  y 
is  maximized. 

The  dependence-maximization  approach  (Song  et  ah, 
2007;  Faivishevsky  &  Goldberger,  2010)  is  re¬ 
lated  to,  but  substantially  different  from  the 
above  information-maximization  approach.  In  the 
dependence-maximization  approach,  cluster  assign¬ 
ments  {j/i}"=1  are  directly  determined  so  that  their 
dependence  on  feature  vectors  {xi}"=1  is  maximized. 
Thus,  the  dependence-maximization  approach  intrin¬ 
sically  involves  combinatorial  optimization  with  re¬ 
spect  to  {j/i}"=  i-  On  the  other  hand,  the  information- 
maximization  approach  involves  continuous  optimiza¬ 
tion  with  respect  to  the  parameter  a  included  in  a 
class-posterior  model  p(y\x\a).  This  continuous  op¬ 
timization  of  a  is  substantially  easier  to  solve  than 
discrete  optimization  of  {yi}"=1. 

Another  advantage  of  the  information-maximization 
approach  is  that  it  naturally  allows  out-of-sample  clus¬ 
tering  based  on  the  discriminative  model  p(y\x;a), 
i.e. ,  a  cluster  assignment  for  a  new  feature  vector  can 
be  obtained  based  on  the  learned  discriminative  model. 


2.  Information-Maximization 
Clustering  with  Squared-Loss 
Mutual  Information 

In  this  section,  we  describe  our  novel  clustering  algo¬ 
rithm. 


2.2.  Squared-Loss  Mutual  Information 

As  an  information  measure,  we  adopt  squared-loss  mu¬ 
tual  information  (SMI).  SMI  between  feature  vector  x 
and  class  label  y  is  defined  by 


SMI 


1 

2 


J2p*(x)p*(y) 

y= i 


p*(x,y) 

p*(x)p*{y) 
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where  p*(x,y)  denotes  the  joint  density  of  x  and  y, 
and  p*(y)  is  the  marginal  probability  of  y.  SMI  is  the 
Pearson  divergence  (Pearson,  1900)  from  p*(x,y)  to 
p*(x)p*(y ),  while  the  ordinary  MI  (Cover  &  Thomas, 
2006)  is  the  Kullback-Leibler  divergence  (Kullback  & 
Leibler,  1951)  from  p*(x,  y)  to  p*(x)p*(y): 

MI  :=  [  Y  p*(x,  y)  log  — (2) 
J  &p*(x)p*(y)  K  1 

The  Pearson  divergence  and  the  Kullback-Leibler  di¬ 
vergence  both  belong  to  the  class  of  Ali-Silvey-Csiszar 
divergences  (which  is  also  known  as  /-divergences,  see 
(Ali  &  Silvey,  1966;  Csiszar,  1967)),  and  thus  they 
share  similar  properties.  For  example,  SMI  is  non¬ 
negative  and  takes  zero  if  and  only  if  x  and  y  are 
statistically  independent,  as  the  ordinary  MI. 

In  the  existing  information-maximization  clustering 
methods  (Agakov  &  Barber,  2006;  Gomes  et  al.,  2010), 
MI  is  used  as  the  information  measure.  On  the  other 
hand,  in  this  paper,  we  adopt  SMI  because  it  allows 
us  to  develop  a  clustering  algorithm  whose  solution 
can  be  computed  analytically  in  a  computationally  ef¬ 
ficient  way  via  eigenvalue  decomposition,  as  described 
below. 


2.3.  Clustering  by  SMI  Maximization 

Here,  we  give  a  computationally-efficient  clustering  al¬ 
gorithm  based  on  SMI  (1). 

We  can  express  SMI  as 


SMI  =  1 
2 


1 

2 


*/  >.  p*(xi v)  ,  i 

yp  (*,!/)— — -ds- 


y= i 


p*{x)p*{y) 


sr  *t  i  \  \P*(v\x)i  1 

yP  ( y\x)p  (x)-^-ydx--. 


y= i 


(3) 

(4) 


sparse  variant  of  the  local-scaling  kernel  (Zelnik-Manor 
&  Perona,  2005): 


I\  (Xi ,  Xj ) 


{exp 

if  Xi  S  Nt(xj)  or  Xj  e  Nt{xi), 
0  otherwise, 


(7) 


where  Aft{x)  denotes  the  set  of  t  nearest  neighbors  for 
x  (t  is  the  kernel  parameter) ,  cq  is  a  local  scaling  factor 
defined  as  cq  =  ||  Xi  —  xf*  || ,  and  xf*  is  the  t-th  nearest 
neighbor  of  aq. 

Further  approximating  the  expectation  with  respect 
to  p*(x)  included  in  Eq.(5)  by  the  empirical  average 
of  samples  {aq}"=1,  we  arrive  at  the  following  SMI 
approximator: 

y—i 

where  T  denotes  the  transpose,  ay  := 
, . . . ,  QLy^ji)  ;  and  Ki,j  • —  AT ( x j ,  Xj ) . 

For  each  cluster  y ,  we  maximize  <xjJK2a.y  under2 
||ccy||  =  1.  Since  this  is  the  Rayleigh  quotient ,  the 
maximizer  is  given  by  the  normalized  principal  eigen¬ 
vector  of  K  (Horn  &  Johnson,  1985).  To  avoid  all  the 
solutions  {o.y}y=  i  to  be  reduced  to  the  same  princi¬ 
pal  eigenvector,  we  impose  their  mutual  orthogonality: 
cx.y  atyf  =  0  for  y  ^  y' .  Then  the  solutions  are  given 
by  the  normalized  eigenvectors  4>1, . . . ,  cf>c  associated 
with  the  eigenvalues  Ai  >  •  •  •  >  Xn  >  0  of  K.  Since 
the  sign  of  cf>y  is  arbitrary,  we  set  the  sign  as 

4>y  =  4>v  x  sign(0jl„), 

where  sign(-)  denotes  the  sign  of  a  scalar  and  ln  de¬ 
notes  the  n-dimensional  vector  with  all  ones. 


Suppose  that  the  class-prior  probability  p*  (y)  is  set  to 
be  uniform:  p*(y)  =  \jc.  Then  Eq.(4)  is  expressed  as 


Let  us  approximate  the  class-posterior  probability 
p*(y\x)  by  the  following  kernel  model: 


On  the  other  hand,  since 

r  1  71 

P*(y)  =  Jp*{y\x)p*{x)dxK,  -  yp(y\xi,oi)  =  <xl Kln, 

i= 1 

and  the  class-prior  probability  p*  (y)  was  set  to  be  uni¬ 
form,  we  have  the  following  normalization  condition: 

<yKln  =  1/c. 


n 

p(y\x-,a)  :=  y/aytiK(x,xi),  (6) 

i= 1 

where  K(x1x')  denotes  a  kernel  function  with  a  ker¬ 
nel  parameter  t.  In  the  experiments,  we  will  use  a 


Furthermore,  probability  estimates  should  be  non¬ 
negative,  which  can  be  achieved  by  rounding  up  nega¬ 
tive  outputs  to  zero.  Taking  these  issues  into  account, 

2  Note  that  this  unit-norm  constraint  is  not  essential 
since  the  obtained  solution  is  renormalized  later. 
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cluster  assignments  {yi}^  1  for  {xi}^L1  are  determined 
as 


Hi  =  argmax 
v 


[max(0„,  4>y)\i 
max(0„,  (f)y)Tln 


where  the  max  operation  for  vectors  is  applied  in  the 
element-wise  manner  and  [•],;  denotes  the  *-th  element 
of  a  vector.  Note  that  we  used  K<f>  =  \y4>y  in  the 
above  derivation. 


We  call  the  above  method  SMI-based  clustering 
(SMIC). 


mized: 

\  f  '^2(r{x,y,0)-r*(x,y))  p*(x)p*(y)dx.  (12) 

J  y=l 

Among  n  cluster  assignments  {y,;}"=1 .  let  ny  be  the 
number  of  samples  in  cluster  y.  Let  Gy  be  the 
parameter  vector  corresponding  to  the  kernel  bases 
{L(x,  xe)}tye=y,  i.e. ,  9y  is  the  ny-dimensional  sub¬ 
vector  of  9  =  (0i ,  ...,0„)T  consisting  of  indices 
{t  |  ye  =  y}.  Then  an  empirical  and  regularized  ver¬ 
sion  of  the  optimization  problem  (12)  is  given  for  each 
y  as  follows: 


2.4.  Kernel  Parameter  Choice  by  SMI 
Maximization 


Since  the  above  clustering  approach  was  developed 
in  the  framework  of  SMI  maximization,  it  would  be 
natural  to  determine  the  kernel  parameters  so  that 
SMI  is  maximized.  A  direct  approach  is  to  use  the 
above  SMI  estimator  SMI  also  for  kernel  parameter 
choice.  However,  this  direct  approach  is  not  favor¬ 
able  because  SMI  is  an  unsupervised  SMI  estimator 
(i.e.,  SMI  is  estimated  only  from  unlabeled  samples 
{c Cj}"=1).  In  the  model  selection  stage,  however,  we 
have  already  obtained  labeled  samples  {(a;*,  t/j)}"^, 
and  thus  supervised  estimation  of  SMI  is  possible.  For 
supervised  SMI  estimation,  a  non-parametric  SMI  esti¬ 
mator  called  least-squares  mutual  information  (LSMI) 
(Suzuki  et  al.,  2009)  was  shown  to  achieve  the  optimal 
convergence  rate.  For  this  reason,  wejoropose  to  use 
LSMI  for  model  selection,  instead  of  SMI  (8). 


LSMI  is  an  estimator  of  SMI  based  on  paired  samples 
{(a:j,2/,)}"=1.  The  key  idea  of  LSMI  is  to  learn  the 
following  density-ratio  function , 


r*(x,y ) 


P*{x,y) 

p*(x)p*(y)' 


(9) 


mm 

6V 


^ejH{v)ey-9^hiv)  +  59^ey 


(13) 

(y) 


where  <5  (>  0)  is  the  regularization  parameter.  H 

^(y) 

is  the  ny  x  ny  matrix  and  h  ‘  is  the  n^-dimensional 
vector  defined  as 

n 

:=  n^'^2L(xi,x{ev))L(xi,x$)), 


i= 1 


■=  ^  L(Xi,X^]) 


where  xf'1  is  the  t- th  sample  in  class  y  (which  corre¬ 
sponds  to  . 


A  notable  advantage  of  LSMI  is  that  the  solution  6 
can  be  computed  analytically  as 

9{v)  =  (HW  +6I)-1h{v). 


(y) 


Then  a  density-ratio  estimator  is  obtained  analytically 
as  follows: 

ny 

r(x,  y)='^2  8ev)L(v, 
e=i 


without  going  through  density  estimation  of  p*(x,y): 
p*{x),  and  p*(y).  More  specifically,  let  us  employ  the 
following  density-ratio  model: 

r{x,  y,  9)  :=  ^  9eL(x,xe ),  (10) 

i-ye—y 

where  L(x,  x1)  is  a  kernel  function  with  kernel  param¬ 
eter  7.  In  the  experiments,  we  will  use  the  Gaussian 
kernel: 

L{ x,x')  =  exp  (-^^2  ^  )  ■  (n) 

The  parameter  9  in  the  above  density-ratio  model  is 
learned  so  that  the  following  squared  error  is  mini- 


The  accuracy  of  the  above  least-squares  density- 
ratio  estimator  depends  on  the  choice  of  the  ker¬ 
nel  parameter  7  and  the  regularization  parameter  5. 
They  can  be  systematically  optimized  based  on  cross- 
validation  as  follows  (Suzuki  et  al.,  2009).  The  sam¬ 
ples  Z  =  {(27,  yi)}™—  1  are  divided  into  M  disjoint  sub¬ 
sets  {Zm}£ f=1  of  approximately  the  same  size.  Then 
a  density-ratio  estimator  rm(x,y)  is  obtained  using 
Z\Zm  (i.e.,  all  samples  without  Zm ),  and  its  out-of- 
sample  error  (which  corresponds  to  Eq.(12)  without 
irrelevant  constant)  for  the  hold-out  samples  Zm  is 
computed  as 


CVm.  := 


1 

2\Zm\2 


J^fm(x,y)2 

x,yE£m 


1 


fm(x,y). 

(x,y)ezm 
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This  procedure  is  repeated  for  m  =  1, . . . ,  M,  and  the 
average  of  the  above  hold-out  error  over  all  m  is  com¬ 
puted.  Finally,  the  kernel  parameter  7  and  the  regu¬ 
larization  parameter  <5  that  minimize  the  average  hold¬ 
out  error  are  chosen  as  the  most  suitable  ones. 

Based  on  the  expression  of  SMI  given  by  Eq.(3),  an 
SMI  estimator  called  LSMI  is  given  as  follows: 

1  n  1 

LSMI  ~  2n  -  2’  (14) 

i—  1 

where  r(x,y)  is  a  density-ratio  estimator  obtained 
above.  Since  r(x,y)  can  be  computed  analytically, 
LSMI  can  also  be  computed  analytically. 

We  use  LSMI  for  model  selection  of  SMIC.  More  specif¬ 
ically,  we  compute  LSMI  as  a  function  of  the  kernel  pa¬ 
rameter  t  of  K( x,x')  included  in  the  cluster-posterior 
model  (6),  and  choose  the  one  that  maximizes  LSMI. 

MATLAB  implementation  of  the  proposed  clus¬ 
tering  method  is  available  from  ‘http://sugiyam.a- 
www.cs.titech.ac.jp/~sugi/  software  /SMIC’. 

3.  Existing  Methods 

In  this  section,  we  qualitatively  compare  the  proposed 
approach  with  existing  methods. 

3.1.  Spectral  Clustering 

The  basic  idea  of  spectral  clustering  (Shi  &  Malik, 
2000)  is  to  first  unfold  non-linear  data  manifolds  by 
a  spectral  embedding  method,  and  then  perform  k- 
means  in  the  embedded  space.  More  specifically,  given 
sample-sample  similarity  Wij  >  0,  the  minimizer  of 
the  following  criterion  with  respect  to  {£,;}"=i  is  ob¬ 
tained  under  some  normalization  constraint: 


1  , 

1  , 

^ 

where  D  is  the  diagonal  matrix  with  i-th  diagonal  el¬ 
ement  given  by  Dij  :=  J/”=1  Wjj.  Consequently,  the 
embedded  samples  are  given  by  the  principal  eigenvec¬ 
tors  of  D~2WD~2 ,  followed  by  normalization.  Note 
that  spectral  clustering  was  shown  to  be  equivalent  to 
a  weighted  variant  of  kernel  k-means  with  some  spe¬ 
cific  kernel  (Dhillon  et  al.,  2004). 

The  performance  of  spectral  clustering  depends  heav¬ 
ily  on  the  choice  of  sample-sample  similarity  Wij. 
Zelnik-Manor  &  Perona  (2005)  proposed  a  useful  un¬ 
supervised  heuristic  to  determine  the  similarity  in  a 
data-dependent  manner,  called  local  scaling:  Wij  = 


exp  ^ ,  where  cq  is  a  local  scaling  factor  de¬ 

fined  as  cq  =  \\xi  —  a:^||,  and  x^  is  the  t-th  nearest 
neighbor  of  aq.  t  is  the  tuning  parameter  in  the  local 
scaling  similarity,  and  t  =  7  was  shown  to  be  use¬ 
ful  (Zelnik-Manor  &  Perona,  2005;  Sugiyama,  2007). 
However,  this  magic  number  ‘7’  does  not  seem  to  work 
always  well  in  general. 

If  D^WD  ' 2  is  regarded  as  a  kernel  matrix,  spec¬ 
tral  clustering  will  be  similar  to  the  proposed  SMIC 
method  described  in  Section  2.3.  However,  SMIC  does 
not  require  the  post  k-means  processing  since  the  prin¬ 
cipal  components  have  clear  interpretation  as  parame¬ 
ter  estimates  of  the  class-posterior  model  (6) .  Further¬ 
more,  our  proposed  approach  provides  a  systematic 
model  selection  strategy,  which  is  a  notable  advantage 
over  spectral  clustering. 

3.2.  Blurring  Mean-Shift  Clustering 

Blurring  mean-shift  (Fukunaga  &  Hostetler,  1975)  is  a 
non-parametric  clustering  method  based  on  the  modes 
of  the  data-generating  probability  density. 

In  the  blurring  mean-shift  algorithm,  a  kernel  density 
estimator  (Silverman,  1986)  is  used  for  modeling  the 
data-generating  probability  density: 

1  n 

p(x )  =  -  J2k  (\\x~xi\\2/v2) > 

i= 1 

where  A~(£)  is  a  kernel  function  such  as  a  Gaussian 
kernel  A'(£)  =  e~^2.  Taking  the  derivative  of  p(x) 
with  respect  to  x  and  equating  the  derivative  at  x  =  aq 
to  zero,  we  obtain  the  following  updating  formula  for 
sample  aq  (i  =  1, ...  ,ri): 

Yhj- 1  WijXj 

Xn  < - =4= - , 

£?=i  Ww 

where  Wij  :=  .A' ^||aq  —  ar,j|2/<T2^  and  K'(tf)  is  the 
derivative  of  K(f).  Each  mode  of  the  density  is  re¬ 
garded  as  a  representative  of  a  cluster,  and  each  data 
point  is  assigned  to  the  cluster  which  it  converges  to. 

Carreira-Perpinan  (2007)  showed  that  the  blur¬ 
ring  mean-shift  algorithm  can  be  interpreted  as 
an  EM  algorithm  (Dempster  et  al.,  1977),  where 
Wij/(J2j'= 1  Wij')  is  regarded  as  the  posterior  prob¬ 
ability  of  the  *-th  sample  belonging  to  the  j-tli  clus¬ 
ter.  Furthermore,  the  above  update  rule  can  be  ex¬ 
pressed  in  a  matrix  form  as  X  < —  XP,  where  X  = 
(x\, . . . ,  xn)  is  a  sample  matrix  and  P  :=  WD~l  is 
a  stochastic  matrix  of  the  random  walk  in  a  graph 
with  adjacency  W  (Chung,  1997).  D  is  defined  as 
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A,i  :=  i  Wqj  and  =  0  for  i  /  j.  If  P  is 
independent  of  X,  the  above  iterative  algorithm  cor¬ 
responds  to  the  power  method  (Golub  &  Loan,  1996) 
for  finding  the  leading  left  eigenvector  of  P.  Then, 
this  algorithm  is  highly  related  to  the  spectral  clus¬ 
tering  which  computes  the  principal  eigenvectors  of 
D^WD  2  (see  Section  3.1).  Although  P  depends 
on  X  in  reality,  Carreira-Perpinan  (2006)  insisted  that 
this  analysis  is  still  valid  since  P  and  X  quickly  reach 
a  quasi-stable  state. 

An  attractive  property  of  blurring  mean-shift  is  that 
the  number  of  clusters  is  automatically  determined  as 
the  number  of  modes  in  the  probability  density  es¬ 
timate.  However,  this  choice  depends  on  the  kernel 
parameter  a  and  there  is  no  systematic  way  to  deter¬ 
mine  (J,  which  is  restrictive  compared  with  the  pro¬ 
posed  method.  Another  critical  drawback  of  the  blur¬ 
ring  mean-shift  algorithm  is  that  it  eventually  con¬ 
verges  to  a  single  point  (Cheng,  1995),  and  therefore  a 
sensible  stopping  criterion  is  necessary  in  practice.  Al¬ 
though  Carreira-Perpinan  (2006)  gave  a  useful  heuris¬ 
tic  for  stopping  the  iteration,  it  is  not  clear  whether 
this  heuristic  always  works  well  in  practice. 

4.  Experiments 

In  this  section,  we  experimentally  evaluate  the  perfor¬ 
mance  of  the  proposed  and  existing  clustering  meth¬ 
ods. 

4.1.  Illustration 

First,  we  illustrate  the  behavior  of  the  proposed 
method  using  artificial  datasets  described  in  the  top 
row  of  Figure  1.  The  dimensionality  is  d  =  2  and  the 
sample  size  is  n  =  200.  As  a  kernel  function,  we  used 
the  sparse  local-scaling  kernel  (7)  for  SMIC,  where  the 
kernel  parameter  t  was  chosen  from  {1, . . . ,  10}  based 
on  LSMI  with  the  Gaussian  kernel  (11). 

The  top  graphs  in  Figure  1  depict  the  cluster  assign¬ 
ments  obtained  by  SMIC,  and  the  bottom  graphs  in 
Figure  1  depict  the  model  selection  curves  obtained 
by  LSMI.  The  results  show  that  SMIC  combined  with 
LSMI  works  well  for  these  toy  datasets. 

4.2.  Performance  Comparison 

Next,  we  systematically  compare  the  performance  of 
the  proposed  and  existing  clustering  methods  using 
various  real-world  datasets  such  as  images,  natural  lan¬ 
guages,  accelerometric  sensors,  and  speech. 

We  compared  the  performance  of  the  following  meth¬ 
ods,  which  all  do  not  contain  open  tuning  parame- 
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Figure  1.  Illustrative  examples.  Cluster  assignments  ob¬ 
tained  by  SMIC  (top)  and  model  selection  curves  obtained 
by  LSMI  (bottom). 


ters  and  therefore  experimental  results  are  fair  and 
objective:  K-means  (KM),  spectral  clustering  with  the 
self-tuning  local-scaling  similarity  (SC)  (Zelnik-Manor 
&  Perona,  2005),  mean  nearest-neighbor  clustering 
(MNN)  (Faivishevsky  &  Goldberger,  2010),  Mi-based 
clustering  for  kernel  logistic  models  (MIC)  (Gomes 
et  al.,  2010)  with  model  selection  by  maximum- 
likelihood  MI  (Suzuki  et  al.,  2008),  and  the  proposed 
SMIC. 

The  clustering  performance  was  evaluated  by  the  ad¬ 
justed  Rand  index  (ARI)  (Hubert  &  Arabie,  1985) 
between  inferred  cluster  assignments  and  the  ground 
truth  categories.  Larger  ARI  values  mean  better  per¬ 
formance,  and  ARI  takes  its  maximum  value  1  when 
two  sets  of  cluster  assignments  are  identical.  In  addi¬ 
tion,  we  also  evaluated  the  computational  efficiency  of 
each  method  by  the  CPU  computation  time. 

We  used  various  real-world  datasets  including  im¬ 
ages,  natural  languages,  accelerometric  sensors,  and 
speech:  The  USPS  hand- written  digit  dataset  (‘digit’), 
the  Olivetti  Face  dataset  (‘face’),  the  20-Newsgroups 
dataset  (‘document’),  the  SENSEVAL-2  dataset 
(‘word’),  the  ALKAN  dataset  (‘accelerometry’),  and 
the  in-house  speech  dataset  (‘speech’).  Detailed  expla¬ 
nation  of  the  datasets  is  omitted  due  to  lack  of  space. 

For  each  dataset,  the  experiment  was  repeated  100 
times  with  random  choice  of  samples  from  a  pool. 
Samples  were  centralized  and  their  variance  was  nor¬ 
malized  in  the  dimension-wise  manner,  before  feeding 
them  to  clustering  algorithms. 

The  experimental  results  are  described  in  Table  1.  For 
the  digit  dataset,  MIC  and  SMIC  outperform  KM,  SC, 
and  MNN  in  terms  of  ARI.  The  entire  computation 
time  of  SMIC  including  model  selection  is  faster  than 
KM,  SC,  and  MIC,  and  is  comparable  to  MNN  which 
does  not  include  a  model  selection  procedure.  For  the 
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Table  1.  Experimental  results  on  real-world  datasets  (with 
equal  cluster  size).  The  average  clustering  accuracy  (and 
its  standard  deviation  in  the  bracket)  in  terms  of  ARI  and 
the  average  CPU  computation  time  in  second  over  100  runs 
are  described.  The  best  method  in  terms  of  the  average 
ARI  and  methods  judged  to  be  comparable  to  the  best 
one  by  the  t-test  at  the  significance  level  1%  are  described 
in  boldface.  Computation  time  of  MIC  and  SMIC  cor¬ 
responds  to  the  time  for  computing  a  clustering  solution 
after  model  selection  has  been  carried  out.  For  references, 
computation  time  for  the  entire  procedure  including  model 
selection  is  described  in  the  square  bracket. 

Digit  ( d  =  256,  n  =  5000,  and  c  =  10) 


KM 

SC 

MNN 

MIC 

SMIC 

ARI 

0.42(0.01) 

0.24(0.02) 

0.44(0.03) 

0.63(0.08) 

0.63(0.05) 

Time 

835.9 

973.3 

318.5 

84.4[3631.7] 

14.4(359.5] 

Face  (d  =  4096, 

n  =  100,  and  c  =  10) 

KM 

SC 

MNN 

MIC 

SMIC 

ARI 

0.60(0.11) 

0.62(0.11) 

0.47(0.10) 

0.64(0.12) 

0.65(0.11) 

Time 

93.3 

2.1 

1.0 

1.4[30.8] 

0.0(19.3] 

Document  (d  =  50,  n  =  700, 

and  c  =  7) 

KM 

SC 

MNN 

MIC 

SMIC 

ARI 

0.00(0.00) 

0.09(0.02) 

0.09(0.02) 

0.01(0.02) 

0.19(0.03) 

Time 

77.8 

9.7 

6.4 

3.4(530.5] 

0.3(115.3] 

Word  (d  =  50, 

n  =  300,  and  c  =  3) 

KM 

SC 

MNN 

MIC 

SMIC 

ARI 

0.04(0.05) 

0.02(0.01) 

0.02(0.02) 

0.04(0.04) 

0.08(0.05) 

Time 

6.5 

5.9 

2.2 

1.0(369.6] 

0.2(203.9] 

Accelerometry  ( d  = 

5,  n  =  300,  and  c  =  3) 

KM 

SC 

MNN 

MIC 

SMIC 

ARI 

0.49(0.04) 

0.58(0.14) 

0.71(0.05) 

0.57(0.23) 

0.68(0.12) 

Time 

0.4 

3.3 

1.9 

0.8(410.6] 

0.2(92.6] 

Speech  ( d  =  50 

n  =  400,  and  c  =  2) 

KM 

SC 

MNN 

MIC 

SMIC 

ARI 

0.00(0.00) 

0.00(0.00) 

0.04(0.15) 

0.18(0.16) 

0.21(0.25) 

Time 

0.9 

4.2 

1.8 

0.7(413.4] 

0.3(179.7] 

face  dataset,  SC,  MIC,  and  SMIC  are  comparable  to 
each  other  and  are  better  than  KM  and  MNN  in  terms 
of  ARI.  For  the  document  and  word  datasets,  SMIC 
tends  to  outperform  the  other  methods.  For  the  ac- 
celerometry  dataset,  MNN  and  SMIC  work  better  than 
the  other  methods.  Finally,  for  the  speech  dataset, 
MIC  and  SMIC  work  comparably  well,  and  are  signif¬ 
icantly  better  than  KM,  SC,  and  MNN. 

Overall,  MIC  was  shown  to  work  reasonably  well,  im¬ 
plying  that  model  selectoin  by  maximum-likelihood  MI 
is  practically  useful.  SMIC  was  shown  to  work  even 
better  than  MIC,  with  much  less  computation  time. 
The  accuracy  improvement  of  SMIC  over  MIC  was 
gained  by  computing  the  SMIC  solution  in  a  closed- 
form  without  any  heuristic  initialization.  The  compu¬ 
tational  efficiency  of  SMIC  was  brought  by  the  analytic 
computation  of  the  optimal  solution  and  the  class- wise 
optimization  of  LSMI  (see  Section  2.4). 

The  performance  of  MNN  and  SC  was  rather  unsta¬ 
ble  because  of  the  heuristic  averaging  of  the  number 
of  nearest  neighbors  and  the  heuristic  choice  of  local 
scaling.  In  terms  of  computation  time,  they  are  rela- 


Table  2.  Experimental  results  on  real-world  datasets  under 
imbalanced  setup.  ARI  values  are  described  in  the  table. 
Class-imbalance  was  realized  by  setting  the  sample  size  of 
the  first  class  m  times  larger  than  other  classes.  The  results 
for  m  =  1  are  the  same  as  the  ones  reported  in  Table  1. 

Digit  (d  =  256,  n  =  5000,  and  c  =  10) 


KM 

SC 

MNN 

MIC 

SMIC 

m 

=  1 

0.42(0.01) 

0.24(0.02) 

0.44(0.03) 

0.63(0.08) 

0.63(0.05) 

m 

=  2 

0.52(0.01) 

0.21(0.02) 

0.43(0.04) 

0.60(0.05) 

0.63(0.05) 

Document  (d  =  50,  n  =  700, 

and  c  =  7) 

KM 

SC 

MNN 

MIC 

SMIC 

m 

=  1 

0.00(0.00) 

0.09(0.02) 

0.09(0.02) 

0.01(0.02) 

0.19(0.03) 

m 

=  2 

0.01(0.01) 

0.10(0.03) 

0.10(0.02) 

0.01(0.02) 

0.19(0.04) 

m 

=  3 

0.01(0.01) 

0.10(0.03) 

0.09(0.02) 

-0.01(0.03) 

0.16(0.05) 

m 

=  4 

0.02(0.01) 

0.09(0.03) 

0.08(0.02) 

-0.00(0.04) 

0.14(0.05) 

Word  (d  =  50, 

n  =  300,  and  c  =  3) 

KM 

SC 

MNN 

MIC 

SMIC 

m 

=  1 

0.04(0.05) 

0.02(0.01) 

0.02(0.02) 

0.04(0.04) 

0.08(0.05) 

m 

=  2 

0.00(0.07) 

-0.01(0.01) 

0.01(0.02) 

-0.02(0.05) 

0.03(0.05) 

Accelerometry  (d 

=  5,  n  =  300 

and  c  =  3) 

KM 

sc 

MNN 

MIC 

SMIC 

m 

=  1 

0.49(0.04) 

0.58(0.14) 

0.71(0.05) 

0.57(0.23) 

0.68(0.12) 

m 

=  2 

0.48(0.05) 

0.54(0.14) 

0.58(0.11) 

0.49(0.19) 

0.69(0.16) 

m 

=  3 

0.49(0.05) 

0.47(0.10) 

0.42(0.12) 

0.42(0.14) 

0.66(0.20) 

m 

=  4 

0.49(0.06) 

0.38(0.11) 

0.31(0.09) 

0.40(0.18) 

0.56(0.22) 

tively  efficient  for  small-  to  medium-sized  datasets,  but 
they  are  expensive  for  the  largest  dataset,  digit.  KM 
was  not  reliable  for  the  document  and  speech  datasets 
because  of  the  restriction  that  the  cluster  boundaries 
are  linear.  For  the  digit ,  face,  and  document  datasets, 
KM  was  computationally  very  expensive  since  a  large 
number  of  iterations  were  needed  until  convergence  to 
a  local  optimum  solution. 

Finally,  we  performed  similar  experiments  under  im¬ 
balanced  setup,  where  the  the  sample  size  of  the  first 
class  was  set  to  be  to  times  larger  than  other  classes. 
The  results  are  summarized  in  Table  2,  showing  that 
the  performance  of  all  methods  tends  to  be  degraded 
as  the  degree  of  imbalance  increases.  Thus,  clustering 
becomes  more  challenging  if  the  cluster  size  is  imbal¬ 
anced.  Among  the  compared  methods,  the  proposed 
SMIC  still  worked  better  than  other  methods. 

Overall,  the  proposed  SMIC  combined  with  LSMI  was 
shown  to  be  a  useful  alternative  to  existing  clustering 
approaches. 

5.  Conclusions 

In  this  paper,  we  proposed  a  novel  information- 
maximization  clustering  method,  which  learns  class- 
posterior  probabilities  in  an  unsupervised  manner  so 
that  the  squared-loss  mutual  information  (SMI)  be¬ 
tween  feature  vectors  and  cluster  assignments  is  maxi¬ 
mized.  The  proposed  algorithm  called  SMI-based  clus¬ 
tering  (SMIC)  allows  us  to  obtain  clustering  solutions 
analytically  by  solving  a  kernel  eigenvalue  problem. 
Thus,  unlike  the  previous  information-maximization 
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clustering  methods  (Agakov  &  Barber,  2006;  Gomes 
et  al.,  2010),  SMIC  does  not  suffer  from  the  prob¬ 
lem  of  local  optima.  Furthermore,  we  proposed  to  use 
an  optimal  non-parametric  SMI  estimator  called  least- 
squares  mutual  information  (LSMI)  for  data-driven 
parameter  optimization.  Through  experiments,  SMIC 
combined  with  LSMI  was  demonstrated  to  compare 
favorably  with  existing  clustering  methods. 
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Abstract 

In  real-world  classification  problems,  the  class 
balance  in  the  training  dataset  does  not  necessar¬ 
ily  reflect  that  of  the  test  dataset,  which  can  cause 
significant  estimation  bias.  If  the  class  ratio  of 
the  test  dataset  is  known,  instance  re-weighting 
or  resampling  allows  systematical  bias  correc¬ 
tion.  However,  learning  the  class  ratio  of  the 
test  dataset  is  challenging  when  no  labeled  data 
is  available  from  the  test  domain.  In  this  paper, 
we  propose  to  estimate  the  class  ratio  in  the  test 
dataset  by  matching  probability  distributions  of 
training  and  test  input  data.  We  demonstrate  the 
utility  of  the  proposed  approach  through  experi¬ 
ments. 


1.  Introduction 

Most  supervised  learning  algorithms  assume  that  train¬ 
ing  and  test  data  follow  the  same  probability  distribution 
(Vapnik,  1998;  Hastieetal.,  2001;  Bishop,  2006).  How¬ 
ever,  this  de  facto  standard  assumption  is  often  violated 
in  real-world  problems,  caused  by  intrinsic  sample  selec¬ 
tion  bias  or  inevitable  non-stationarity  (Heckman,  1979; 
Quinonero-Candela  et  al.,  2009;  Sugiyama  &  Kawanabe, 
2012). 

In  classification  scenarios,  changes  in  class  balance  are  of¬ 
ten  observed — for  example,  the  male-female  ratio  is  almost 
fifty-fifty  in  the  real-world  (test  set),  whereas  training  sam¬ 
ples  collected  in  a  research  laboratory  tends  to  be  domi¬ 
nated  by  male  data.  Such  a  situation  is  called  a  class-prior 
change,  and  the  bias  caused  by  differing  class  balances  can 
be  systematically  adjusted  by  instance  re-weighting  or  re¬ 
sampling  if  the  class  balance  in  the  test  dataset  is  known 
(Elkan,  2001;  Lin  et  al.,  2002). 
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However,  the  class  ratio  in  the  test  dataset  is  often  unknown 
in  practice.  A  possible  approach  to  coping  with  this  prob¬ 
lem  is  to  leam  a  classifier  so  that  the  perfonnance  for  all 
possible  class  balances  is  improved,  e.g.,  through  maxi¬ 
mization  of  the  area  under  the  ROC  curve  (Cortes  &  Mohri, 
2004;  Clemengon  et  al.,  2009).  Another,  possibly  more  di¬ 
rect  approach  is  to  estimate  the  class  ratio  in  the  test  dataset 
and  use  the  estimates  for  instance  re-weighting  or  resam¬ 
pling.  In  this  paper,  we  focus  on  the  latter  scenario  under 
a  semi-supervised  learning  setup  (Chapelle  et  al.,  2006), 
where  no  labeled  data  is  available  from  the  test  domain. 

Saerens  et  al.  (2001)  is  a  seminal  paper  on  this  topic,  which 
proposed  to  estimate  the  class  ratio  by  the  expectation- 
maximization  (EM)  algorithm  (Dempster  et  al.,  1977) — 
alternately  updating  the  test  class-prior  and  class-posterior 
probabilities  from  some  initial  estimates  until  conver¬ 
gence.  This  method  has  been  successfully  applied  to  var¬ 
ious  real-world  problems  such  as  word  sense  disambigua¬ 
tion  (Chan  &  Ng,  2006)  and  remote  sensing  (Latinne  et  al., 
2001). 

In  this  paper,  we  first  reformulate  the  above  algorithm, 
and  show  that  this  actually  con'esponds  to  approximat¬ 
ing  the  test  input  distribution  by  a  linear  combination  of 
class-wise  input  distributions  under  the  Kullback-Leibler 
(KL)  divergence  (Kullback  &  Leibler,  1951).  In  this  pro¬ 
cedure,  the  class-wise  input  distributions  are  approximated 
via  class-posterior  estimation,  for  example,  by  kernel  logis¬ 
tic  regression  (Hastie  et  al.,  2001)  or  its  squared-loss  vari¬ 
ant  (Sugiyama,  2010). 

This  new  fonnulation  motivates  us  to  develop  a  new  ap¬ 
proach,  since  indirectly  estimating  the  divergence  by  esti¬ 
mating  the  individual  class-posterior  distributions  may  not 
be  the  best  scheme.  Recently,  KL  divergence  estimation 
based  on  direct  density-ratio  estimation  has  been  shown  to 
be  promising  (Nguyen  et  al.,  2010;  Sugiyama  et  al.,  2008). 
Furthermore,  a  squared-loss  variant  of  the  KL  divergence 
called  the  Pearson  (PE)  divergence  (Pearson,  1900)  can 
also  be  approximated  in  the  same  way,  with  an  analytic 
solution  that  can  be  computed  efficiently  (Kanamori  et  al., 
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2009a).  The  PE  divergence  and  the  KL  divergence  both  be¬ 
long  to  the  /-divergence  class  (Ali  &  Silvey,  1 966;  Csiszar, 
1967),  which  share  similar  properties.  In  this  paper,  with 
the  aid  of  this  density-ratio  based  PE  divergence  estimator, 
we  propose  a  new  semi-supervised  method  for  estimating 
the  class  ratio  in  the  test  dataset.  Through  experiments,  we 
demonstrate  the  usefulness  of  the  proposed  method. 

2.  Problem  Formulation  and  Existing  Method 

In  this  section,  we  formulate  the  problem  of  semi- 
supervised  class-prior  estimation  and  review  an  existing 
method  (Saerens  et  al.,  2001). 

2.1.  Problem  Formulation 


as  p(y)  =  Uy/n,  where  ny  is  the  number  of  training 
samples  in  class  y.  Set  the  initial  estimate  of  the  test 
class-posterior  probability  equal  to  it:  =  p{y). 


3.  Repeat  until  convergence:  t  =  1,2,... 


(a)  Compute  a  new  test  class-posterior  estimate 
p[(y\x)  based  on  the  current  test  class-prior  es¬ 
timate  p't_1(y)  as 


%{y\x) 


%-i{y)p{.y\x)/p(y) 

Hcy'=ip,t-i(y')p(.y'\x)/p(.y')' 


(2) 


(b)  Compute  a  new  test  class-prior  estimate  i^(y) 
based  on  the  current  test  class-prior  estimate 

Pt(y\x) as 


Let  x  G  Rd  be  the  ci-dimensional  input  data,  y  G 
(1, . . . ,  c}  be  the  class  label,  and  c  be  the  number  of 
classes.  We  consider  class-prior  change,  i.e.,  the  class- 
prior  probability  for  training  data  p{y)  and  that  for  test  data 
p'(y)  are  different.  However,  we  assume  that  the  class- 
conditional  density  for  training  data p(x\y)  and  that  for  test 
data  p'{x\y)  are  the  same: 

p(x\y)  =  p'(x\y).  (1) 

Note  that  training  and  test  joint  densities  p(x ,  y)  and 
p'(x,y )  as  well  as  training  and  test  input  densities  p{x) 
and  p'{x)  are  generally  different  under  this  setup. 

The  goal  of  this  paper  is  to  estimate  p'(y)  from  labeled 
training  samples  {(*i,  2/i)}”=i  drawn  independently  from 
p{x,y)  and  unlabeled  test  samples  {*'}"=1  drawn  inde¬ 
pendently  from  p'{x).  Given  test  labels  {y[}™=1,  p'{y)  can 
be  naively  estimated  by  n'y/n' ,  where  n'y  is  the  number  of 
test  samples  in  class  y.  Here,  however,  we  would  like  to 
estimate  p'  (y)  without  {r/'}”=r 


%{y)  =  -^J2p,t(y\x'i)-  (3) 


This  procedure  was  shown  to  converge  to  a  local  optimal 
solution. 

Note  that  Eq.(2)  comes  from  the  Bayes  formulae, 

,  ,  ,  p(y\x)p(x)  .  ,,  i  x  p'(y\x)p'(x) 

p(x\y)  =  - r— -  and  p  (x\y)  = 


p{y) 

combined  with  Eq.(l): 


p'{y ) 


p'{y\x )  oc  ^f\p{y\x). 
p(y) 

Eq.(3)  comes  from  empirical  marginalization  of 
p'{y)  =  j  p\y\x)p'(x)dx. 

3.  Reformulation  of  the  EM  Algorithm  as 
Distribution  Matching 


2.2.  Existing  Method 

We  give  a  brief  overview  of  an  existing  method  for  semi- 
supervised  class-prior  estimation  (Saerens  et  al.,  2001), 
which  is  based  on  the  expectation-maximization  (EM)  al¬ 
gorithm  (Dempster  et  al.,  1977). 

In  the  algorithm,  test  class-prior  and  class-posterior  esti¬ 
mates  and  p1  {y\x)  are  iteratively  updated  as  follows: 


In  this  section,  we  show  that  the  above  EM  algorithm  can 
be  interpreted  as  matching  the  test  input  density  to  a  lin¬ 
ear  combination  of  class-wise  input  distributions  under  the 
Kullback-Leibler  (KL)  divergence  (Kullback  &  Leibler, 
1951). 

Based  on  the  assumption  that  the  class-conditional  densi¬ 
ties  for  training  and  test  data  are  unchanged  (see  Eq.(l)), 
let  us  model  the  test  input  density  p'{x)  by 


1.  Obtain  an  estimate  of  the  training  class-posterior 
probability,  p(y\x),  from  training  data  {(*,,  yi)}™=1, 
for  example,  by  kernel  logistic  regression 
(Hastie  et  al.,  2001)  or  its  squared-loss  variant 
(Sugiyama,  2010). 

2 .  Obtain  an  estimate  of  the  training  class-prior  probabil¬ 
ity,  p{y),  from  the  labeled  training  data  {(*,,  t/i)}"=1 


(/{x)  ~  yjyp(x\y),  (4) 

y= i 

where  6y  is  a  coefficient  corresponding  to  p'  (y): 

c 

E^  =  1-  <5) 

y= i 


Semi-Supervised  Learning  of  Class  Balance  under  Class-Prior  Change  by  Distribution  Matching 


We  match  the  model  q'{x)  with  the  test  input  density  p'{x) 
under  the  KL  divergence: 


KL(p'||</)  :=  [ p'{x) 

=  f  p' (x)  \ogp' [x)dx 

-  j p'(x)\og  (  ^T9yp(a 


dx.  (6) 


Ignoring  the  first  term  (which  is  a  constant)  and  approxi¬ 
mating  the  expectation  in  the  second  term  with  its  empirical 
average  give  the  following  optimization  problem: 


max 


1 

i=l  \y=  1 


(7) 


subject  to  Eq.(5). 

Since  the  above  maximization  is  a  convex  optimiza¬ 
tion  problem,  the  Karush-Kuhn-Tucker  (KKT)  con¬ 
ditions  are  necessary  and  sufficient  for  optimality 
(Boyd  &  Vandenberghe,  2004).  The  KKT  conditions  for 
the  above  problem  is  given  by  Eq.(5)  and 


i=  1 


p(xi\y) 

T,y=i  Qy'PWW) 


V ,  Vy  = 


Therefore,  the  EM  method  is  essentially  equivalent  to 
matching  the  training  and  test  input  distributions  under  the 
KL  divergence,  which  uses  the  class-conditional  density 
p{x\y)  as  a  building  block  (see  Eq.(8)).  However,  this  fact 
is  not  apparent  in  the  EM  expression  because  of  the  cancel¬ 
lation  of  p{x'i)  in  the  numerator  and  denominator. 

The  convexity  of  Eq.(7)  implies  that  there  are  no  local  min¬ 
ima.  However,  this  was  not  recognized  in  Saerens  et  al. 
(2001)  since  the  algorithm  was  derived  via  the  incomplete 
data  EM  method. 

4.  Class-Prior  Estimation  by  Direct 
Divergence  Minimization 

The  analysis  in  the  previous  section  motivates  us  to  explore 
a  more  direct  way  to  leam  coefficients  {9y}y=1.  That  is, 
given  an  estimator  of  a  divergence  from  p'  to  q' ,  coeffi¬ 
cients  {9y}y=1  are  learned  so  that  the  divergence  estimator 
is  minimized. 

In  this  section,  we  first  review  a  general  frame¬ 
work  of  approximating  the  / -divergences  (Ali  &  Silvey, 
1966;  Csiszar,  1967)  via  Legendre-Fenchel  convex  duality 
(Keziou,  2003;  Nguyen  et  al.,  2010).  Then  we  review  two 
specific  methods  of  divergence  estimation  for  the  KL  di¬ 
vergence  and  the  Pearson  (PE)  divergence  (Pearson,  1900). 
Finally,  we  propose  to  use  the  PE  divergence  estimator  for 
determining  the  coefficients  {9y}y=1. 


where  u  is  a  Lagrange  multiplier.  From  these  equations,  we 
can  determine  v  as 


v  =  1  -  v  — 


( i  p(xi\y)  \ 

y1'  Y,Cy=i9yp{x'i\y')  J 


if  EhWW 

hi  T,y=i9yp{x'i\y') 


Then  the  solution  {9y}y=1  can  be  calculated  by  fixed-point 
iteration  as  follows  (McLachlan  &  Krishnan,  1997): 


4.1.  Framework  of  /-Divergence  Approximation 

An  /-divergence  (Ali  &  Silvey,  1966;  Csiszar,  1967)  from 
pf  to  q'  is  a  general  divergence  measure  defined  by  a  convex 
function  /  such  that  /( 1)  =  0  as 

d,w\w)  ■■=  Jp'(*v  to- 

It  was  shown  that  the  /-divergence  can  be  lower-bounded 
via  Legendre-Fenchel  convex  duality  (Rockafellar,  1 970)  as 
follows  (Keziou,  2003;  Nguyen  et  al.,  2010): 


(  1  y-  P{x'i\y )  \ 

yri  h  T,Cy=l9yP(x'i\y)  J  ' 


(8) 


Making  the  substitution  p(x'i\y)  =  p(y\x'i)p(x,i) /p(y), 
canceling  p{x’i)  in  the  numerator  and  denominator,  and  re¬ 
placing  p(y\x)  with  p(y\x),  we  can  show  that  the  above 
updating  formula  is  reduced  to 


n  _  1  flyPfaKVPp/) 

y  n1  hK'=idy'p(y'W/p(y'y 

which  is  the  same  as  Eq.(3)  with  Eq.(2)  substituted. 


Df(p'\\q')  =  max 

r 

j  q'  (x)r{x)dx 

- 

J  P'{x)f*(r(x))dx 

(9) 


where  f*  is  the  convex  conjugate  of  /.  The  maximum  is 
achieved  if  and  only  if  r(x)  =  q'(x)/p'(x).  Eq.(9)  is  a 
useful  expression  because  the  right-hand  side  only  contains 
expectations  of  r  and  f*(r(x)),  which  can  be  simply  ap¬ 
proximated  by  sample  averages. 

Below,  we  show  specific  methods  of  divergence  approx¬ 
imation  for  the  KL  and  PE  divergences  under  model  (4) 
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and  the  following  parametric  expression  of  the  density  ra¬ 
tio  r(x): 

b 

r(x)  =  '52am(x),  (10) 

e=o 

where  {ae}\=0  are  parameters  and  {<pe(x)}g= 0  are  basis 
functions.  In  practice,  we  use  a  constant  basis  and  Gaussian 
kernels  centered  at  the  training  data  points,  i.e.,  for  b  =  n 
and  <»l,2,...,n, 

<Po(x)  =  1  and  <pe(x)  =  exp 

This  provides  a  non-parametric  divergence  esti¬ 
mator  (Nguyen  et  al.,  2010;  Sugiyama  et  ah,  2008; 
Kanamori  et  ah,  2012). 


f  \\X~XAV^ 


4.3.  PE-Divergence  Approximation 

As  an  alternative  to  the  KL-divergence,  let  us  consider  the 
PE  divergence  defined  by 

pE(f/||</)  P*\J  -  l)  p\x) dx,  (11) 

which  is  a  squared-loss  variant  of  the  KL  divergence  and  is 
a  /-divergence  with  /(it)  =  (t  —  l)2/2. 

For  this  /,  the  convex  conjugate  is  given  by  f*(v)  = 
v2 /2+v.  Then,  an  empirical  approximation  of  Eq.(9)  under 
(4)  and  (10)  is  given  as  follows  (Kanamori  et  ah,  2009a): 

-  ^aTGa  +  aTH6  -  ^  , 


PE^'llg')  «  max 

a. 


4.2.  KL-Divergence  Approximation 


With  f{u)  =  —  log  u  for  u  >  0  and  +oo  for  u  <  0,  the 
/-divergence  is  reduced  to  the  KL  divergence.  For  this  /, 
the  convex  conjugate  is  given  by  f*(v)  =  —1  —  log(— v) 
for  v  <  0  and  +oo  for  v  >  0.  Then,  if  —  an  is  regarded 
as  an,  an  empirical  approximation  of  Eq.(9)  under  (4)  and 
(10)  is  given  as  follows  (Nguyen  et  ah,  2010): 


KL(p'\\q')  ss  max 
{“tlLo 


-Yzr  Y  Yawt(xi) 


/  V'fJ 

y=  1  y  i:yi—y  i—0 


i=  1 


+  ^Elo4Ea<^)  +1 


?=0 


subject  to  ao,  a\, . . . ,  ab  >  0.  A  similar  approach,  which 
directly  estimates  the  inverted  ratio  p'(x)/q'(x)  with  the 
same  model  (10),  is  also  known  (Sugiyama  et  ah,  2008): 


KL(j/||</)  «  max 

l“GLo 


riYl°Z\YaWe(x'i) 


’=0 


subject  to  ao,  ai,  •  •  • ,  at,  >  0  and 

c  e  b 


Y  7T  Y  Yat^x^  =  L 


-  Tly 

y—1  y  i:yi=y  £—0 


These  are  convex  optimization  problems,  and  thus  global 
optimal  solutions  can  be  obtained  by  naive  optimization. 
Tuning  parameters  possibly  included  in  the  basis  func¬ 
tion  such  as  the  kernel  width  can  be  systematically  op¬ 
timized  by  cross-validation  (Sugiyama  et  ah,  2008).  The 
KL-divergence  estimator  obtained  above  was  proved  to 
possess  superior  convergence  properties  both  in  para¬ 
metric  and  non-parametric  setups  (Sugiyama  et  ah,  2008; 
Nguyen  et  ah,  2010). 

However,  computing  the  KL-divergence  estimator  is  rather 
time-consuming  because  optimization  of  (a^}(L0  needs  to 
be  carried  out  for  each  {0v}y=1. 


where 


^  1  -  , 

a  =  [a0a1  ■■■  ab]T  ,  G  =  —  V  ip(x'i)ip{x'i)T , 

n  ' 

i=  1 

¥>(x)  =  [<fio{x)  <fi(x)  ■  ■  ■  ipb{x)\ ,  H  =  ^h i  •  •  •  hc j  , 
hy  =  —  Y  ¥>(*»)>  6  =  [0iO2  ■■■  9c\t  . 

a  7.  :7/„-  =ni 


A  regularized  solution  to  the  above  maximization  problem 
can  be  obtained  analytically  as 


a=(d  +  \R )  lH6 ,  (12) 

where  A  is  a  positive  constant  and  R  is  defined  as 

H  _  0  Oixh 

O&Xl  Ibxb 


The  PE  divergence  estimator  obtained  above  was  proved 
to  have  superior  convergence  properties  both  in  parametric 
and  non-parametric  setups  (Kanamori  et  ah,  2009a;  2012). 
Tuning  parameters  possibly  included  in  the  basis  func¬ 
tion  such  as  the  kernel  width  or  the  regularization  param¬ 
eter  can  be  systematically  optimized  by  cross-validation 
(Kanamori  et  ah,  2009a;  2012). 


4.4.  Learning  Class  Ratios  by  PE  Divergence  Matching 

As  shown  above,  the  KL  and  PE  divergences  can  be 
systematically  estimated  without  density  estimation  via 
Legendre-Fenchel  convex  duality.  Among  them,  the  PE  di¬ 
vergence  estimator,  explicitly  expressed  as 

pe(0)  :=  —  \)dTHT^G  +  \R^j  1d(d  +  XRj  Hd 

+  0tHt  (d  +  XR^j  1H9  -i, 
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is  more  useful  for  our  purpose  of  learning  class  ratios, 
because  of  the  following  reasons:  The  PE-divergence 
was  shown  to  be  more  robust  against  outliers  than 
the  KL-divergence,  based  on  power  divergence  analysis 
(Basuetal.,  1998;  Sugiyama  et  ah,  2012).  This  is  a  use¬ 
ful  property  in  practical  data  analysis  suffering  high  noise 
and  outliers.  Furthermore,  the  above  PE-divergence  esti¬ 
mator  was  shown  to  possess  the  minimum  condition  num¬ 
ber  among  a  general  class  of  estimators,  meaning  that  it  is 
the  most  stable  estimator  (Kanamori  et  ah,  2009b). 

Another,  and  practically  more  important  advantage  of  the 
above  PE  divergence  estimator  is  that  it  can  be  computed 
efficiently  and  analytically.  This  advantage  is  even  more 
crucial  in  our  case  because  we  minimize  the  above  PE  di¬ 
vergence  estimator  with  respect  to  0: 

nun  PE(0) 

C 

subject  to  ^  0y  =  1  and  $i, . . . ,  0C  >  0. 
v= i 

Because  PE(0)  is  given  analytically  as  a  function  of  6,  we 
can  easily  obtain  the  minimizer  6  by  simple  optimization 
strategies  such  as  alternate  gradient  descent  and  projection 
or  just  a  grid  search,  without  re-computing  the  PE  diver¬ 
gence  estimator. 

5.  Experiments 

In  this  section,  we  report  experimental  results. 

5.1.  Setup 

The  following  five  methods  are  compared: 

•  EM-KLR:  The  method  of  Saerens  et  al.  (2001)  (see 
Section  2.2).  The  class-posterior  probability  of  the 
training  dataset  is  estimated  using  -penalized  ker¬ 
nel  logistic  regression  with  Gaussian  kernels.  The 
L-BFGS  quasi-Newton  implementation  included  in 
the  ‘minFunc’  package  is  used  for  logistic  regression 
training  (Schmidt,  2005). 

•  KL-KDE:  The  KL  divergence  estimator  based  on  ker¬ 
nel  density  estimation  (KDE).  The  class-wise  input 
densities  are  estimated  by  KDE  with  Gaussian  ker¬ 
nels.  The  kernel  widths  are  estimated  using  likelihood 
cross-validation  (Silverman,  1986). 

•  PE-KDE:  The  PE  divergence  estimator  based  on 
KDE.  The  class-wise  input  densities  are  estimated 
by  KDE  with  Gaussian  kernels.  The  kernel  widths 
are  estimated  using  least-squares  cross-validation 
(Silverman,  1986). 


Table  1.  Datasets  used  in  the  experiments. 


Dataset 

d 

#  samples 

#  positives 

#  negatives 

Australian 

14 

690 

307 

383 

Diabetes 

8 

768 

500 

268 

German 

24 

1000 

300 

700 

Ionosphere 

34 

351 

225 

126 

SAHeart 

9 

462 

302 

160 

Twonorm 

20 

7400 

3697 

3703 

•  KL-DR:  The  proposed  method  (see  Section  4.2)  using 
a  KL  divergence  estimator  based  on  the  density  ratio 
(DR).  For  the  optimization,  the  L-BFGS  with  projec¬ 
tion  implementation  ‘minFuncBC’  is  used  (Schmidt, 
2005). 

•  PE-DR:  The  proposed  method  (see  Section  4.4)  using 
the  PE  divergence  estimator  based  on  DR. 

Below,  we  compare  accuracy  of  class-prior  estimation  and 
classification. 

5.2.  Benchmark  Datasets 

Here,  we  use  binary-classification  benchmark  datasets 
listed  in  Table  1.  We  select  10  samples  from  each  of  the 
two  classes  for  the  training  dataset  and  50  samples  for  the 
test  dataset.  The  samples  in  the  test  set  are  selected  with 
probability  6*  from  the  first  class  and  (1  —  0*)  from  the 
second  class,  where  9*  =  0.1, 0.2,  0.3, 0.4, 0.5. 

The  average  squared  error  of  the  estimated  class  ratios  are 
given  in  Figure  1.  This  shows  that  methods  based  on  the 
KL  and  PE  divergences  overall  outperform  EM-KLR,  im¬ 
plying  that  our  reformulation  of  the  EM  algorithm  as  dis¬ 
tribution  matching  (see  Section  3)  contributes  to  obtaining 
accurate  class-ratio  estimates.  Among  the  KL-based  meth¬ 
ods,  KL-KDE  tends  to  perform  better  than  KL-DR.  This 
is  because,  in  KL-KDE,  we  did  not  estimate  the  first  term 
in  Eq.(6),  which  is  the  negative  entropy  and  is  a  constant. 
On  the  other  hand,  the  negative  entropy  is  also  implicitly 
estimated  in  KL-DR,  possibly  incurring  additional  estima¬ 
tion  error.  Among  the  PE-based  methods,  PE-DR  outper¬ 
forms  PE-KDE,  showing  that  directly  estimating  density 
ratios  without  density  estimation  is  more  promising  as  a 
PE  divergence  estimator.  Overall,  PE-DR  is  shown  to  be 
the  most  accurate. 

Next,  we  compare  classification  accuracy  when  the  learned 
class-prior  probabilities  are  used  as  instance  weights.  Fig¬ 
ure  2  shows  misclassification  rates  for  a  regularized  least- 
squares  classifier  (Rifkin  et  al.,  2003)  with  instance  weight¬ 
ing.  The  results  show  that,  as  expected,  a  more  accurate 
estimate  of  the  class  ratio  tends  to  give  a  lower  misclassifi¬ 
cation  rate. 
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Figure  1.  Average  squared  error  between  the  true  class  ratio  6*  and  estimated  class  ratio  9  for  the  benchmark  datasets  listed  in  Table  1. 
The  best  method  and  comparable  methods  according  to  the  t-test  at  significance  level  of  5%  are  indicated  with  a  ‘o’ 


Australian 


Diabetes 


German 


Ionosphere 


SAHeart 


True  class  ratio© 


Twonorm 


Figure  2.  Average  misclassification  rates  for  the  datasets  listed  in  Table  1.  Classification  is  performed  using  a  regularized  least-squares 
classifier  with  instance  weighting.  The  best  method  and  comparable  methods  according  to  the  t-test  at  significance  level  of  5%  are 
indicated  with  a  ‘o’. 


5.3.  Real-World  Application 

Finally,  we  demonstrate  the  usefulness  of  the  proposed  ap¬ 
proach  in  a  real-world  problem  of  military  vehicle  classi¬ 
fication  from  geophone  recordings  (Duarte  &  Hu,  2004). 
This  is  a  three  class  problem:  Two  vehicle  classes  and  a 
class  of  recorded  noise.  The  features  are  50-dimensional. 
In  this  vehicle  classification  task,  class-prior  change  is  in¬ 


evitable  because  the  type  of  vehicles  passing  through  dif¬ 
fers  depending  on  time  (e.g.,  day  and  night). 

n  samples  are  drawn  from  each  of  the  labeled  classes  for 
the  training  set  with  the  uniform  class  prior,  whereas  100 
samples  are  drawn  with  probabilities  p  =  [0.6  0.1  0.3]  from 
each  of  the  classes  for  the  test  set.  Due  to  the  prohibitive 
computational  cost,  KL-DR  was  not  included  in  this  exper- 


Semi-Supervised  Learning  of  Class  Balance  under  Class-Prior  Change  by  Distribution  Matching 


iment. 

In  Figure  3,  we  plot  the  ^-distance  between  the  true  and  es¬ 
timated  class  priors  and  the  misclassification  rate  based  on 
instance-weighted  kernel  logistic  regression  (Hastie  et  al., 
2001)  averaged  over  1000  runs  as  functions  of  the  num¬ 
ber  of  training  samples.  As  can  be  seen  from  the  graphs, 
the  performance  of  all  methods  improves  as  the  number  of 
training  samples  increases.  Among  the  compared  methods, 
PE-DR  provides  the  most  accurate  estimates  of  the  class 
prior  and  thus  yields  the  lowest  classification  error. 

6.  Conclusion 

Class-prior  change  is  a  problem  that  is  conceivable  in  many 
real-world  datasets,  and  it  can  be  systematically  corrected 
for  if  the  class-prior  of  the  test  dataset  is  known.  In  this 
paper,  we  discussed  the  problem  of  estimating  the  test  class 
ratios  under  the  semi-supervised  learning  setup. 

We  first  showed  that  the  EM-based  estimator  introduced  in 
Saerens  et  al.  (2001)  can  be  regarded  as  indirectly  match¬ 
ing  the  test  input  distribution  by  a  linear  combination 
of  class-wise  input  distributions.  Based  on  this  view, 
we  proposed  to  use  an  explicit  and  possibly  more  accu¬ 
rate  divergence  estimator  based  on  density-ratio  estimation 
(Kanamori  et  al.,  2009a)  for  learning  test  class-priors.  The 
proposed  method  was  shown  to  have  various  nice  proper¬ 
ties  such  as  high  robustness  to  noise  and  outliers,  superior 
numerical  stability,  and  excellent  computational  efficiency. 
Through  experiments,  we  showed  that  the  class  ratios  esti¬ 
mated  by  the  proposed  method  are  more  accurate  than  com¬ 
peting  methods,  which  can  be  translated  into  better  classi¬ 
fication  accuracy. 
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Abstract 

We  address  the  problem  of  estimating  the  difference  between  two  probability  den¬ 
sities.  A  naive  approach  is  a  two-step  procedure  of  first  estimating  two  densities 
separately  and  then  computing  their  difference.  However,  such  a  two-step  proce¬ 
dure  does  not  necessarily  work  well  because  the  first  step  is  performed  without  re¬ 
gard  to  the  second  step  and  thus  a  small  estimation  error  incurred  in  the  first  stage 
can  cause  a  big  error  in  the  second  stage.  In  this  paper,  we  propose  a  single-shot 
procedure  for  directly  estimating  the  density  difference  without  separately  esti¬ 
mating  two  densities.  We  derive  a  non-parametric  finite-sample  error  bound  for 
the  proposed  single-shot  density-difference  estimator  and  show  that  it  achieves  the 
optimal  convergence  rate.  We  then  show  how  the  proposed  density-difference  es¬ 
timator  can  be  utilized  in  /^-distance  approximation.  Finally,  we  experimentally 
demonstrate  the  usefulness  of  the  proposed  method  in  robust  distribution  compar¬ 
ison  such  as  class-prior  estimation  and  change-point  detection. 


1  Introduction 

When  estimating  a  quantity  consisting  of  two  elements,  a  two-stage  approach  of  first  estimating 
the  two  elements  separately  and  then  approximating  the  target  quantity  based  on  the  estimates  of 
the  two  elements  often  performs  poorly,  because  the  first  stage  is  carried  out  without  regard  to  the 
second  stage  and  thus  a  small  estimation  error  incurred  in  the  first  stage  can  cause  a  big  error  in  the 
second  stage.  To  cope  with  this  problem,  it  would  be  more  appropriate  to  directly  estimate  the  target 
quantity  in  a  single-shot  process  without  separately  estimating  the  two  elements. 

A  seminal  example  that  follows  this  general  idea  is  pattern  recognition  by  the  support  vector  ma¬ 
chine  [1]:  Instead  of  separately  estimating  two  probability  distributions  of  patterns  for  positive  and 
negative  classes,  the  support  vector  machine  directly  learns  the  boundary  between  the  two  classes 
that  is  sufficient  for  pattern  recognition.  More  recently,  a  problem  of  estimating  the  ratio  of  two 
probability  densities  was  tackled  in  a  similar  fashion  [2,  3] :  The  ratio  of  two  probability  densities  is 
directly  estimated  without  going  through  separate  estimation  of  the  two  probability  densities. 

In  this  paper,  we  further  explore  this  line  of  research,  and  propose  a  method  for  directly  estimating 
the  difference  between  two  probability  densities  in  a  single-shot  process.  Density  differences  would 
be  more  desirable  than  density  ratios  because  density  ratios  can  diverge  to  infinity  even  under  a 
mild  condition  (e.g.,  two  Gaussians  [4]),  whereas  density  differences  are  always  finite  as  long  as 
each  density  is  bounded.  Density  differences  can  be  used  for  solving  various  machine  learning  tasks 
such  as  class-balance  estimation  under  class-prior  change  [5]  and  change-point  detection  in  time 
series  [6]. 

For  this  density-difference  estimation  problem,  we  propose  a  single-shot  method,  called  the  least- 
squares  density-difference  (LSDD)  estimator,  that  directly  estimates  the  density  difference  without 
separately  estimating  two  densities.  LSDD  is  derived  with  in  the  framework  of  kernel  regularized 
least-squares  estimation,  and  thus  it  inherits  various  useful  properties:  For  example,  the  LSDD 
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solution  can  be  computed  analytically  in  a  computationally  efficient  and  stable  manner,  and  all 
tuning  parameters  such  as  the  kernel  width  and  the  regularization  parameter  can  be  systematically 
and  objectively  optimized  via  cross-validation.  We  derive  a  finite-sample  error  bound  for  the  LSDD 
estimator  and  show  that  it  achieves  the  optimal  convergence  rate  in  a  non-parametric  setup. 

We  then  apply  LSDD  to  L2-distance  estimation  and  show  that  it  is  more  accurate  than  the  differ¬ 
ence  of  KDEs,  which  tends  to  severely  under-estimate  the  /f -distance  [7].  Because  the  L2-distance 
is  more  robust  against  outliers  than  the  Kullback-Leibler  divergence  [8],  the  proposed  L2-distance 
estimator  can  lead  to  the  paradigm  of  robust  distribution  comparison.  We  experimentally  demon¬ 
strate  the  usefulness  of  LSDD  in  semi-supervised  class-prior  estimation  and  unsupervised  change 
detection. 


2  Density-Difference  Estimation 

In  this  section,  we  propose  a  single-shot  method  for  estimating  the  difference  between  two  proba¬ 
bility  densities  from  samples,  and  analyze  its  theoretical  properties. 


Problem  Formulation  and  Naive  Approach:  First,  we  formulate  the  problem  of  density- 
difference  estimation.  Suppose  that  we  are  given  two  sets  of  independent  and  identically  distributed 
samples  X  :=  {xi}™=1  and  X'  :=  {a^,}"=1  from  probability  distributions  on  with  densities 
p(x)  and  p'{ x),  respectively.  Our  goal  is  to  estimate  the  density  difference, 

f{x)  :=p(x)-p'( x), 

from  the  samples  X  and  X' . 

A  naive  approach  to  density-difference  estimation  is  to  use  kernel  density  estimators  (KDEs).  How¬ 
ever,  we  argue  that  the  KDE-based  density-difference  estimator  is  not  the  best  approach  because 
of  its  two-step  nature.  Intuitively,  good  density  estimators  tend  to  be  smooth  and  thus  the  differ¬ 
ence  between  such  smooth  density  estimators  tends  to  be  over-smoothed  as  a  density-difference 
estimator  [9],  To  overcome  this  weakness,  we  give  a  single-shot  procedure  of  directly  estimating  the 
density  difference  f(x)  without  separately  estimating  the  densities  p(x)  and  p'(x). 


Least-Squares  Density-Difference  Estimation:  In  our  proposed  approach,  we  fit  a  density- 
difference  model  g(x)  to  the  true  density-difference  function  f(x)  under  the  squared  loss: 


argmin  J  (g(x)  -  f(x)^  da;. 


We  use  the  following  Gaussian  kernel  model  as  g{ x): 

n+n  ,  ||  no 

I \x-ct¥ 


d(x)  =  9eexp  (  “ 


2cr2 


(1) 


where  (ci, . . . ,  c„,  cn+1, . . . ,  cn+n>)  :=  (xlt . . . ,  xn,  x\. . . . ,  x'n ,)  are  Gaussian  kernel  centers.  If 
n  +  n'  is  large,  we  may  use  only  a  subset  of  {x\, . . . ,  xn,  x^, . . . ,  x'n,}  as  Gaussian  kernel  centers. 

For  the  model  (1),  the  optimal  parameter  9*  is  given  by 


9*  :=  argmin  /  ( g(x)  —  /( x) )  da;  =  argmin 
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where  H  is  the  (n  +  n')  x  (n  +  n')  matrix  and  h  is  the  (n  +  n') -dimensional  vector  defined  as 
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Replacing  the  expectations  in  h  by  empirical  estimators  and  adding  an  f^-regularizer  to  the  objective 
function,  we  arrive  at  the  following  optimization  problem: 


9  :=  argmin 
e 


9  HG  —  2h  9  +  XG  9 


(2) 
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where  A  (>  0)  is  the  regularization  parameter  and  h  is  the  (n  +  n')-dimensional  vector  defined  as 


Taking  the  derivative  of  the  objective  function  in  Eq.(2)  and  equating  it  to  zero,  we  can  obtain  the 
solution  analytically  as 

9  =  {H  +  XI)~1h, 

where  I  denotes  the  identity  matrix. 

Finally,  a  density-difference  estimator  f(x),  which  we  call  the  least-squares  density-difference 
(LSDD)  estimator,  is  given  as 


n-\-n' 

f(x)  =  ^2  dn  exp 
i 


Non-Parametric  Error  Bound:  Here,  we  theoretically  analyze  an  estimation  error  of  LSDD. 

We  assume  n'  =  n,  and  let  'H-,  be  the  reproducing  kernel  Hilbert  space  (RKHS)  corresponding  to 
the  Gaussian  kernel  with  width  7:  fc7(£E,£e')  =  exp  (—  ||x  —  x’\\2 /^2).  Let  us  consider  a  slightly 
modified  LSDD  estimator  that  is  more  suitable  for  non-parametric  error  analysis1 : 

s(4)J  +ANI^  . 
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Then  we  have  the  following  theorem: 

Theorem  1.  Suppose  that  there  exists  a  constant  M  such  that  ||p||oo  <  AT  and  ||f/||oo  <  AT. 
Suppose  also  that  the  density  difference  f  =  p  —  p'  is  a  member  of  Besov  space  with  regularity’  a. 
That  is,  f  £  ^  where  B%  ^  is  the  Besov  space  with  regularity  a,  and 

Il/Ilu“  :=  ||/IIl2(r“)  +sup(f-“Wrji2(Rd)(/,f))  <  c  forr  =  [a\  +  1, 

’  t>  0 


where  |_aj  denotes  the  largest  integer  less  than  or  equal  to  a  and  wr  ^2(R<i)  is  the  r-th  modulus  of 
smoothness  (see  [10]  for  the  definitions).  Then,  for  all  e  >  0  and  p  £  (0, 1),  there  exists  a  constant 
K  >  0  depending  on  AT,  c,  e,  and  p  such  that  for  all  n  >  1,  r  >  1,  and  A  >  0,  the  LSDD  estimator 
f  in  Tl1  satisfies 


\\f-f\\lHm+X\\f\\uy  <  K\  A7-d+72“  + 
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with  probability  not  less  than  1  —  4e  T 


_  2  q  +  d _ _ 1 _ 

If  we  set  A  =  n  (2<*+<n(i +p)+(«-p+«p)  and  7  =  n  (2“+<0(i+p)+(«-p+«p) ,  and  take  e  and  p  sufficiently 
small,  then  we  immediately  have  the  following  corollary. 

Corollary  1.  Suppose  that  the  same  assumptions  as  Theorem  1  hold.  Then,  for  all  p ,  p'  >  0,  there 
exists  a  constant  K  >  0  depending  on  AT,  c,  p,  and  p'  such  that,  for  all  n  >  1  and  r  >  1,  the 
density-difference  estimator  /  with  appropriate  choice  of  7  and  A  satisfies 

11/ —  f\\h(R*)  +  aII/IIm7  <  K  (n~^+p  +  ™-1+p'j 

with  probability  not  less  than  1  —  4e-T. 

1  More  specifically,  the  regularizer  is  replaced  from  the  squared  ^2-norm  of  parameters  to  the  squared  RKHS- 

nomr  of  a  learned  function,  which  is  necessary  to  establish  consistency.  Nevertheless,  we  use  the  squared 
^2-norm  of  parameters  in  experiments  because  it  is  simpler  and  seems  to  perform  well  in  practice. 
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_  2a; 

Note  that  n  2a+d  is  the  optimal  learning  rate  to  estimate  a  function  in  B%  ^ .  Therefore,  the  density- 
difference  estimator  with  a  Gaussian  kernel  achieves  the  optimal  learning  rate  by  appropriately 
choosing  the  regularization  parameter  and  the  Gaussian  width.  Because  the  learning  rate  depends 
on  a,  the  LSDD  estimator  has  adaptivity  to  the  smoothness  of  the  true  function. 

It  is  known  that,  if  the  naive  KDE  with  a  Gaussian  kernel  is  used  for  estimating  a  probability  density 
with  regularity  a  >  2,  the  optimal  learning  rate  cannot  be  achieved  [11,  12].  To  achieve  the  optimal 
rate  by  KDE,  we  should  choose  a  kernel  function  specifically  tailored  to  each  regularity  a  [13]. 
However,  such  a  kernel  function  is  not  non-negative  and  it  is  difficult  to  implement  it  in  practice. 
On  the  other  hand,  our  LSDD  estimator  can  always  achieve  the  optimal  learning  rate  for  a  Gaussian 
kernel  without  regard  to  regularity  a. 


Model  Selection  by  Cross-Validation:  The  above  theoretical  analysis  showed  the  superiority  of 
LSDD.  However,  in  practice,  the  performance  of  LSDD  depends  on  the  choice  of  models  (i.e., 
the  kernel  width  a  and  the  regularization  parameter  A).  Here,  we  show  that  the  model  can  be 
optimized  by  cross-validation  (CV).  More  specifically,  we  first  divide  the  samples  X  = 
and  X'  =  {a:',}”=1  into  T  disjoint  subsets  {Xt}f=i  and  {A’/}^11,  respectively.  Then  we  obtain  a 
density-difference  estimate  ft(x)  from  X\Xt  and  X’\X[  (i.e.,  all  samples  without  Xt  and  X[),  and 
compute  its  hold-out  error  for  Xt  and  X[  as 


CV(4)  :=  /  ft(x)2dx  -  —  +  |^7|  5Z  £(*')> 


x£Xt 


\Xi\ 


v'ex! 


where  \X\  denotes  the  number  of  elements  in  the  set  X.  We  repeat  this  hold-out  validation  proce¬ 
dure  for  t  =  1 , ,T,  and  compute  the  average  hold-out  error.  Finally,  we  choose  the  model  that 
minimizes  the  average  hold-out  error. 


3  L2-Distance  Estimation  by  LSDD 

In  this  section,  we  consider  the  problem  of  approximating  the  i2-distance  between  p(x)  and  p'(x), 

L2{p,p')  :=  J  (p(x)  -p'(x)fdx, 

from  their  independent  and  identically  distributed  samples  X  :=  {®i}"=  \  and  X'  :=  { x 
For  an  equivalent  expression  L2(p,p')  =  f  f(x)p(x)dx  —  f  f  (x')p'  {x')dx' ,  if  we  replace  /( x) 
with  an  LSDD  estimator  f{x)  and  approximate  the  expectations  by  empirical  averages,  we  obtain 
L2(p,p')  ~  h  9.  Similarly,  for  another  expression  L2(p,p')  =  f  f(x)2dx,  replacing  f(x)  with 
an  LSDD  estimator  f(x)  gives  L2(jp,p')  «  6  HO. 

Although  h  0  and  0  HO  themselves  give  approximations  to  l2(p.  //),  we  argue  that  the  use  of 
their  combination,  defined  by 


L2(X,  X')  :=  2 hT0  -  0^ HO,  (3) 

is  more  sensible.  To  explain  the  reason,  let  us  consider  a  generalized  L2  -distance  estimator  of  the 

form  /3h  0  +  (1  —  f3)0  HO,  where  /3  is  a  real  scalar.  If  the  regularization  parameter  A  (>  0)  is 
small,  this  can  be  expressed  as 


phT0  +  (1  -  p)0T HO  =  hT H_1h  -  A(2  -  P)hT H~2h  +  op( A),  (4) 

where  op  denotes  the  probabilistic  order.  Thus,  up  to  Op( A),  the  bias  introduced  by  regularization 
(i.e.,  the  second  term  in  the  right-hand  side  of  Eq.(4)  that  depends  on  A)  can  be  eliminated  if  /3  =  2, 

which  yields  Eq.(3).  Note  that,  if  no  regularization  is  imposed  (i.e.,  A  =  0),  both  h  0  and  0  HO 
yield  h  H  h ,  the  first  term  in  the  right-hand  side  of  Eq.(4). 


4 


Eq.(3)  is  actually  equivalent  to  the  negative  of  the  optimal  objective  value  of  the  LSDD  optimization 
problem  without  regularization  (i.e.,  Eq.(2)  with  A  =  0).  This  can  be  naturally  interpreted  through  a 
lower  bound  of  L2(p,p')  obtained  by  Legendre-Fenchel  convex  duality  [14]: 


L2{p,p' )  =  sup 

9 


g(x)p{x)dx  —  /  g{x')p' (x')dx'  )  —  /  g(x)2dx 


where  the  supremum  is  attained  at  g  =  /.  If  the  expectations  are  replaced  by  empirical  estima¬ 
tors  and  the  Gaussian  kernel  model  (1)  is  used  as  g,  the  above  optimization  problem  is  reduced 
to  the  LSDD  objective  function  without  regularization  (see  Eq.(2)).  Thus,  LSDD  corresponds  to 
approximately  maximizing  the  above  lower  bound  and  Eq.(3)  is  its  maximum  value. 

Through  eigenvalue  decomposition  of  H,  we  can  show  that  2 h  0  —  0  HO  >  h  0  >  0  HO. 
Thus,  our  approximator  (3)  is  not  less  than  the  plain  approximators  h  0  and  0  HO. 


4  Experiments 

In  this  section,  we  experimentally  demonstrate  the  usefulness  of  LSDD.  A  MATLAB®  implemen¬ 
tation  of  LSDD  used  for  experiments  is  available  from 

“http : //sugiyama- www. cs . titech . ac . jp/~sugi/ software /LSDD/”. 

Illustration:  Let  N(x;  fi.  S)  be  the  multi-dimensional  normal  density  with  mean  vector  //  and 
variance-covariance  matrix  £  with  respect  to  x,  and  let 

p(x)  =  AT  (sc;  (p,0,...,  0)T,  (47t)_1  Id)  and  p' (x)  =  N(x;  (0,0,...,  0)T,  (4tt)_1  Id). 

We  first  illustrate  how  LSDD  behaves  under  d  =  1  and  n  =  n'  =  200.  We  compare  LSDD  with 
KDEi  (KDE  with  two  Gaussian  widths  chosen  independently  by  least-squares  cross-validation  [15]) 
and  KDEj  (KDE  with  two  Gaussian  widths  chosen  jointly  to  minimize  the  LSDD  criterion  [9]).  The 
number  of  folds  in  cross-validation  is  set  to  5  for  all  methods. 

Figure  1  depicts  density-difference  estimation  results  obtained  by  LSDD,  KDEi,  and  KDEj  for  p  =  0 
(i.e.,  f(x)  =  p( x)  —  p'{x)  =  0).  The  figure  shows  that  LSDD  and  KDEj  give  accurate  estimates 
of  the  density  difference  f(x)  =  0.  On  the  other  hand,  the  estimate  obtained  by  KDEi  is  rather 
fluctuated,  although  both  densities  are  reasonably  well  approximated  by  KDEs.  This  illustrates  an 
advantage  of  directly  estimating  the  density  difference  without  going  through  separate  estimation  of 
each  density.  Figure  2  depicts  the  results  for  p  =  0.5  (i.e.,  /( x)  /  0),  showing  again  that  LSDD 
performs  well.  KDEi  and  KDEj  give  the  same  estimation  result  for  this  dataset,  which  slightly 
underestimates  the  peaks. 

Next,  we  compare  the  performance  of  L2 -distance  approximation  based  on  LSDD,  KDEi,  and  KDEj. 
For  /i  =  0, 0.2,  0.4, 0.6,  0.8  and  d  =  1, 5,  we  draw  n  =  nf  —  200  samples  from  the  above  p(x) 
and  p'(x).  Figure  3  depicts  the  mean  and  standard  error  of  estimated  L2-distances  over  1000  runs 
as  functions  of  mean  p.  When  d  =  1  (Figure  3(a)),  the  LSDD-based  L2-distance  estimator  gives 
the  most  accurate  estimates  of  the  true  L2 -distance,  whereas  the  KDEi-based  L2 -distance  estimator 
slightly  underestimates  the  true  /.^’-distance  when  p  is  large.  This  is  caused  by  the  fact  that  KDE 
tends  to  provide  smooth  density  estimates  (see  Figure  2(b)  again):  Such  smooth  density  estimates 
are  accurate  as  density  estimates,  but  the  difference  of  smooth  density  estimates  yields  a  small  L2- 
distance  estimate  [7].  The  KDEj-based  /,2-distance  estimator  tends  to  improve  this  drawback  of 
KDEi,  but  it  still  slightly  underestimates  the  true  L2 -distance  when  //  is  large. 

When  d  =  5  (Figure  3(b)),  the  KDE-based  L2 -distance  estimators  even  severely  underestimate 
the  true  L2 -distance  when  p  is  large.  On  the  other  hand,  the  LSDD-based  L2 -distance  estimator 
still  gives  reasonably  accurate  estimates  of  the  true  /A-distance  even  when  d  =  5.  However,  we 
note  that  LSDD  also  slightly  underestimates  the  true  /A-distance  when  p  is  large,  because  slight 
underestimation  tends  to  yield  smaller  variance  and  thus  such  stabilized  solutions  are  more  accurate 
in  terms  of  the  bias-variance  trade-off. 

Semi-Supervised  Class-Balance  Estimation:  In  real-world  pattern  recognition  tasks,  changes  in 
class  balance  between  the  training  and  test  phases  are  often  observed.  In  such  cases,  naive  classifier 
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(a)  LSDD  (b)  KDEi  (c)  KDEj 

Figure  1:  Estimation  of  density  difference  when  p  =  0  (i.e.,  f(x)  =  p(x)  —  p'{x)  =  0). 


(a)  LSDD  (b)  KDEi  (c)  KDEj 

Figure  2:  Estimation  of  density  difference  when  p  =  0.5  (i.e.,  f(x)  =  p{ x)  —  p'(x)  ^  0). 


(a)  d  =  1  (b)  d  =  5 

Figure  3:  L2 -distance  estimation  by  LSDD,  KDEi,  and  KDEj  for  n  =  n!  =  200  as  functions  of  the 
Gaussian  mean  /j.  Means  and  standard  errors  over  1000  runs  are  plotted. 


training  produces  significant  estimation  bias  because  the  class  balance  in  the  training  dataset  does 
not  properly  reflect  that  of  the  test  dataset. 

Here,  we  consider  a  binary  pattern  recognition  task  of  classifying  pattern  x  £  to  class  y  € 
{+1,  —1}.  Our  goal  is  to  learn  the  class  balance  of  a  test  dataset  in  a  semi-supervised  learning  setup 
where  unlabeled  test  samples  are  provided  in  addition  to  labeled  training  samples  [16].  The  class 
balance  in  the  test  set  can  be  estimated  by  matching  a  mixture  of  class-wise  training  input  densities, 

gtest(*;  7!")  :=  Strain  (a:  |  y  =  +1)  +  (1  -  n)ptTa,in(x\y  =  -1), 

to  the  test  input  density  ptes t(x)  [5],  where  it  €  [0, 1]  is  a  mixing  coefficient  to  learn.  See  Figure  4 
for  schematic  illustration.  Here,  we  use  the  L2 -distance  estimated  by  LSDD  and  the  difference  of 
KDEs  for  this  distribution  matching.  Note  that,  when  LSDD  is  used  to  estimate  the  L2-distance, 
separate  estimation  of  ptia,in(x\y  =  ±1)  is  not  involved,  but  the  difference  between  ptest(x)  and 
qtest(x;  7 r)  is  directly  estimated. 

We  use  four  UCI  benchmark  datasets  (http  :  //archive  .  ics  .  uci  .  edu/ml/),  where  we  ran¬ 
domly  choose  10  labeled  training  samples  from  each  class  and  50  unlabeled  test  samples  following 
true  class-prior  7r*  =  0.1,  0.2, . . . ,  0.9.  Figure  6  plots  the  mean  and  standard  error  of  the  squared 
difference  between  true  and  estimated  class-balances  7r  and  the  misclassification  error  by  a  weighted 
(^-regularized  least-squares  classifier  [17]  with  weighted  cross-validation  [18]  over  1000  runs.  The 
results  show  that  LSDD  tends  to  provide  better  class-balance  estimates  than  the  KDEi-based,  the 
KDEj -based,  and  the  EM-based  methods  [5],  which  are  translated  into  lower  classification  errors. 
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Unsupervised  Change  Detection:  The  objective  of  change  detection  is  to  discover  abrupt  prop¬ 
erty  changes  behind  time-series  data.  Let  y(t)  £  Rm  be  an  m-dimensional  time-series  sample  at 
time  t,  and  let  Y  (t)  :=  [y(t)T  ,y(t  +  1)T, . . . ,  y(t  +  k  —  1)T]T  £  be  a  subsequence  of  time 
series  at  time  t  with  length  k.  We  treat  the  subsequence  Y  (t)  as  a  sample,  instead  of  a  single  point 
y(t),  by  which  time-dependent  information  can  be  incorporated  naturally  [6].  Let  y(t)  be  a  set  of  r 
retrospective  subsequence  samples  starting  at  time  t:  y(t )  :=  {Y (i),  Y(t  +  1), . . . ,  Y{t  +  r  —  1)}. 
Our  strategy  is  to  compute  a  certain  dissimilarity  measure  between  two  consecutive  segments  y(t) 
and  y(t+r),  and  use  it  as  the  plausibility  of  change  points  (see  Figure  5).  As  a  dissimilarity  measure, 
we  use  the  L2 -distance  estimated  by  LSDD  and  the  Kullback-Leibler  (KL)  divergence  estimated  by 
the  KL  importance  estimation  procedure  (KLIEP)  [2,  3].  We  set  k  =  10  and  r  =  50. 

First,  we  use  the  IPSJ  SIG-SLP  Corpora  and  Environments  for  Noisy  Speech  Recognition  (CEN- 
SREC)  dataset  (http://research.nii.ac.jp/src/en/CENSREC-l-C.html).  This 
dataset  is  provided  by  the  National  Institute  of  Informatics,  Japan  that  records  human  voice  in  a 
noisy  environment  such  as  a  restaurant.  The  top  graphs  in  Figure  7(a)  display  the  original  time- 
series  (true  change  points  were  manually  annotated)  and  change  scores  obtained  by  KLIEP  and 
LSDD.  The  graphs  show  that  the  LSDD-based  change  score  indicates  the  existence  of  change  points 
more  clearly  than  the  KLIEP-based  change  score. 

Next,  we  use  a  dataset  taken  from  the  Human  Activity  Sensing  Consortium  (HASC)  challenge 
2011  (http:  /  /ha sc  .  jp/hc2  011/),  which  provides  human  activity  information  collected  by 
portable  three-axis  accelerometers.  Because  the  orientation  of  the  accelerometers  is  not  necessarily 
fixed,  we  take  the  ^2-norm  of  the  3-dimensional  data.  The  HASC  dataset  is  relatively  simple,  so 
we  artificially  added  zero-mean  Gaussian  noise  with  standard  deviation  5  at  each  time  point  with 
probability  0.005.  The  top  graphs  in  Figure  7(b)  display  the  original  time-series  for  a  sequence  of 
actions  “jog”,  “stay”,  “stair  down”,  “stay”,  and  “stair  up”  (there  exists  4  change  points  at  time  540, 
1110,  1728,  and  2286)  and  the  change  scores  obtained  by  KLIEP  and  LSDD.  The  graphs  show  that 
the  LSDD  score  is  much  more  stable  and  interpretable  than  the  KLIEP  score. 

Finally,  we  compare  the  change-detection  performance  more  systematically  using  the  receiver  op¬ 
erating  characteristic  (ROC)  curves  (i.e.,  the  false  positive  rate  vs.  the  true  positive  rate)  and  the 
area  under  the  ROC  curve  (AUC)  values.  In  addition  to  LSDD  and  KLIEP,  we  test  the  L2-distance 
estimated  by  KDEi  and  KDEj  and  native  change  detection  methods  based  on  autoregressive  models 
(AR)  [19],  subspace  identification  (SI)  [20],  singular  spectrum  transformation  (SST)  [21],  one-class 
support  vector  machine  (SVM)  [22],  kernel  Fisher  discriminant  analysis  (KFD)  [23],  and  kernel 
change-point  detection  (KCP)  [24].  Tuning  parameters  included  in  these  methods  were  manually  op¬ 
timized.  For  10  datasets  taken  from  each  of  the  CENSREC  and  HASC  data  collections,  mean  ROC 
curves  and  AUC  values  are  displayed  at  the  bottom  of  Figure  7(b).  The  results  show  that  LSDD  tends 
to  outperform  other  methods  and  is  comparable  to  state-of-the-art  native  change-detection  methods. 

5  Conclusions 

In  this  paper,  we  proposed  a  method  for  directly  estimating  the  difference  between  two  probability 
density  functions  without  density  estimation.  The  proposed  method,  called  the  least-squares  density- 
difference  (LSDD),  was  derived  within  the  framework  of  kernel  least-squares  estimation,  and  its 
solution  can  be  computed  analytically  in  a  computationally  efficient  and  stable  manner.  Furthermore, 
LSDD  is  equipped  with  cross-validation,  and  thus  all  tuning  parameters  such  as  the  kernel  width  and 
the  regularization  parameter  can  be  systematically  and  objectively  optimized.  We  derived  a  finite- 
sample  error  bound  for  LSDD  in  a  non-parametric  setup,  and  showed  that  it  achieves  the  optimal 
convergence  rate.  We  also  proposed  an  L2-distance  estimator  based  on  LSDD,  which  nicely  cancels 
a  bias  caused  by  regularization.  Through  experiments  on  class-prior  estimation  and  change-point 
detection,  the  usefulness  of  the  proposed  LSDD  was  demonstrated. 
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Figure  4:  Class-balance  estimation. 


Figure  5:  Change -point  detection. 
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Figure  6:  Results  of  semi-supervised  class-balance  estimation.  Top:  Squared  error  of  class  balance 
estimation.  Bottom:  Misclassification  error  by  a  weighted  -regularized  least-squares  classifier. 
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Figure  7:  Results  of  unsupervised  change  detection.  From  top  to  bottom:  Original  time-series, 
change  scores  obtained  by  KL1EP  and  LSDD,  mean  ROC  curves  over  10  datasets,  and  AUC  values 
for  10  datasets.  The  best  method  and  comparable  ones  in  terms  of  mean  AUC  values  by  the  t-test  at 
the  significance  level  5%  are  indicated  with  boldface.  “SE”  stands  for  “Standard  error”. 
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Abstract 

The  goal  of  the  two-sample  test  (a.k.a.  the  homogeneity  test)  is,  given  two  sets  of 
samples,  to  judge  whether  the  probability  distributions  behind  the  samples  are  the 
same  or  not.  In  this  paper,  we  propose  a  novel  non-parametric  method  of  two-sample 
test  based  on  a  least-squares  density  ratio  estimator.  Through  various  experiments, 
we  show  that  the  proposed  method  overall  produces  smaller  type-II  error  (i.e. ,  the 
probability  of  judging  the  two  distributions  to  be  the  same  when  they  are  actually 
different)  than  a  state-of-the-art  method,  with  slightly  larger  type-I  error  (i.e.,  the 
probability  of  judging  the  two  distributions  to  be  different  when  they  are  actually 
the  same). 

Keywords 

two-sample  test,  homogeneity  test,  density  ratio  estimation,  unconstrained  least- 
squares  importance  fitting,  Pearson  divergence. 
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1  Introduction 

Given  two  sets  of  samples,  testing  whether  the  probability  distributions  behind  the  sam¬ 
ples  are  equivalent  or  not  is  a  fundamental  task  in  statistical  data  analysis.  This  problem 
is  referred  to  as  the  two-sample  test  or  the  homogeneity  test  (Kullback,  1959). 

1.1  Motivation  of  Two-Sample  Test 

The  two-sample  test  is  useful  in  various  practically  important  learning  scenarios.  Here  we 
describe  some  examples. 

When  learning  is  performed  under  non-stationary  environment,  e.g.,  in  brain- computer 
interface  (Sugiyama  et  ah,  2007)  and  robot  control  (Hachiya  et  al.,  2009),  testing  homo¬ 
geneity  of  data  generating  distributions  allows  one  to  determine  whether  some  adaptation 
scheme  should  be  used  or  not.  When  the  distributions  are  not  significantly  different, 
one  can  avoid  using  data-intensive  non-stationarity  adaptation  techniques,  which  highly 
contributes  to  stabilizing  the  performance. 

When  multiple  sets  of  data  samples  are  available  for  learning,  e.g.,  biological  exper¬ 
imental  results  obtained  from  different  laboratories  (Borgwardt  et  ah,  2006),  the  homo¬ 
geneity  test  allows  one  to  make  a  decision  whether  all  the  datasets  are  analyzed  jointly 
as  a  single  dataset  or  they  should  be  treated  separately. 

Similarly,  one  can  use  the  homogeneity  test  for  deciding  whether  multi-task  learning 
methods  (Caruana  et  ah,  1997)  are  employed  or  not.  The  rationale  behind  multi-task 
learning  is  that  when  several  related  learning  tasks  are  provided,  solving  them  simul¬ 
taneously  can  give  better  solutions  than  solving  them  individually.  However,  when  the 
tasks  are  not  similar  to  each  other,  using  multi-task  learning  techniques  can  degrade  the 
performance.  Thus,  it  is  important  to  avoid  using  multi-task  learning  methods  when  the 
tasks  are  not  similar  to  each  other.  This  may  be  achieved  by  testing  the  homogeneity  of 
datasets. 

When  several  databases  containing  multiple  fields  are  given,  it  is  useful  to  identify  the 
correspondence  between  fields  by  comparing  underlying  distributions  since  this  allows  one 
to  merge  databases  (Gretton  et  ah,  2007). 

1.2  Methods  of  Two-Sample  Test 

The  t-test  (Student,  1908)  is  a  classical  method  for  testing  homogeneity,  which  compares 
the  means  of  two  Gaussian  distributions  with  common  variance.  Its  multi-variate  ex¬ 
tension  also  exists  (Hotelling,  1951).  Although  the  t-test  is  a  fundamental  method  for 
comparing  the  means,  its  range  of  application  is  limited  to  Gaussian  distributions,  which 
may  not  be  fulfilled  in  practical  applications. 

The  Kolmogorov- Smirnov  test  and  the  Wald- Wolf owitz  runs  test  are  classical  non- 
parametric  methods  for  the  two-sample  problem;  their  multi- dimensional  variants  have 
also  been  developed  (Bickel,  1969;  Friedman  &  Rafsky,  1979).  Since  then,  different  types 
of  non-parametric  tests  have  been  studied  (Anderson  et  ah,  1994;  Li,  1996). 
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Recently,  a  non-parametric  extension  of  the  t-test  called  the  maximum  mean  discrep¬ 
ancy  (MMD)  was  proposed  (Borgwardt  et  ah,  2006;  Gretton  et  ah,  2007).  MMD  compares 
the  means  of  two  distributions  in  a  universal  reproducing  kernel  Hilbert  space  (universal 
RKHS;  Steinwart,  2001) — the  Gaussian  kernel  is  a  typical  example  that  induces  a  uni¬ 
versal  RKHS.  MMD  does  not  require  a  restrictive  parametric  assumption,  so  it  could  be 
a  flexible  alternative  to  the  t-test.  MMD  was  experimentally  shown  to  outperform  other 
homogeneity  tests  such  as  the  generalized  Kolmogorov-Smirnov  test  (Friedman  &  Rafsky, 
1979),  the  generalized  Wald- Wolf owitz  test  (Friedman  &  Rafsky,  1979),  the  Hall-Tajvidi 
test  (Hall  &  Tajvidi,  2002),  and  the  Biau-Gyorfi  test  (Biau  &  Gydrfi,  2005). 

The  performance  of  MMD  depends  on  the  choice  of  universal  RKHSs  (e.g.,  the  Gaus¬ 
sian  width  in  the  case  of  Gaussian  RKHSs).  Thus,  the  universal  RKHS  should  be  carefully 
chosen  for  obtaining  the  state-of-the-art  performance.  The  Gaussian  RKHS  with  width 
set  to  the  median  distance  between  samples  has  been  a  popular  heuristic  in  practice  (Borg¬ 
wardt  et  ah,  2006;  Gretton  et  al.,  2007).  Recently,  a  novel  idea  of  using  the  universal 
RKHS  (or  the  Gaussian  widths)  yielding  the  maximum  MMD  value  has  been  introduced 
(Sriperumbudur  et  ah,  2009). 

1.3  Divergence  Estimation 

Another  approach  to  the  two-sample  problem  is  to  evaluate  a  divergence  between  two 
distributions.  The  divergence-based  approach  is  advantageous  in  that  cross-validation 
over  the  divergence  functional  is  available  for  optimizing  tuning  parameters  in  a  data- 
dependent  manner.  A  typical  choice  of  the  divergence  functional  would  be  the  /- 
divergences  (Ali  &  Silvey,  1966;  Csiszar,  1967),  which  includes  the  Kullback-Leibler  diver¬ 
gence  (Kullback  &  Leibler,  1951)  and  the  Pearson  divergence  (Pearson,  1900)  as  special 
cases. 

Various  methods  for  estimating  the  divergence  functional  have  been  studied  so  far 
(Darbcllay  &  Vajda,  1999;  Wang  et  ah,  2005;  Silva  &  Narayanan,  2007;  Perez-Cruz, 
2008).  Among  them,  approaches  based  on  density  ratio  estimation  have  been  shown  to 
be  promising  both  theoretically  and  experimentally  (Sugiyama  et  ah,  2008;  Gretton  et  ah, 
2009;  Kanamori  et  ah,  2009a;  Nguyen  et  ah,  2010).  So  far,  a  parametric  density  ratio 
estimator  based  on  logistic  regression  (Qin,  1998;  Cheng  &  Chu,  2004)  has  been  applied 
to  the  test  of  homogeneity  (Keziou  &  Leoni-Aubin,  2005). 

Although  the  density  ratio  estimator  based  on  logistic  regression  was  proved  to  achieve 
the  smallest  asymptotic  variance  among  a  class  of  semi-parametric  estimators  (Qin,  1998), 
this  theoretical  guarantee  is  valid  only  when  the  parametric  model  is  correctly  specified 
(i.e.,  the  target  density  ratio  is  included  in  the  parametric  model  at  hand).  However, 
when  this  unrealistic  assumption  is  violated,  a  divergence-based  density  ratio  estimator 
(Sugiyama  et  ah,  2008;  Nguyen  et  ah,  2010)  was  shown  to  perform  better  (Kanamori 
et  ah,  2010). 

Among  various  divergence-based  density  ratio  estimators,  a  method  called  uncon¬ 
strained  least-squares  importance  fitting  (uLSIF)  was  demonstrated  to  be  accurate  and 
computationally  efficient  (Kanamori  et  ah,  2009a).  Furthermore,  uLSIF  was  proved  to 
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possess  the  optimal  non-parametric  convergence  rate  and  numerical  stability  (Kanamori 
et  ah,  2009b).  In  this  paper,  we  therefore  develop  a  new  method  for  testing  homogeneity 
based  on  uLSIF. 

Similarly  to  MMD,  our  uLSIF-based  homogeneity  test  processes  data  samples  only 
through  kernel  functions.  Thus,  the  proposed  method  can  be  used  for  testing  the  homo¬ 
geneity  of  non-vectorial  structured  objects  such  as  strings,  trees,  and  graphs  by  employing 
kernel  functions  defined  for  such  structured  data  (Lodhi  et  ah,  2002;  Duffy  &  Collins,  2002; 
Kashima  &  Koyanagi,  2002;  Kondor  &  Lafferty,  2002;  Kashima  et  al.,  2003;  Gartner  et  ah, 
2003;  Gartner,  2003).  This  is  an  advantage  over  traditional  two-sample  tests. 

1.4  Organization  of  This  Paper 

The  rest  of  this  paper  is  structured  as  follows.  In  Section  2,  we  review  the  uLSIF  method 
for  density  ratio  estimation.  In  Section  3,  we  describe  a  method  of  divergence  estimation 
based  on  uLSIF,  and  investigate  its  theoretical  properties.  In  Section  4,  we  give  a  two- 
sample  test  based  on  the  permutation  test  (Efron  &  Tibshirani,  1993),  which  we  call 
least-squares  two-sample  test  (LSTT).  We  review  the  MMD  method  in  Section  5,  and 
compare  the  experimental  performance  of  LSTT  with  MMD  in  Section  6.  Finally,  we 
conclude  in  Section  7. 


2  Density  Ratio  Estimation 

In  this  section,  we  consider  the  problem  of  density  ratio  estimation,  and  review  a  method 
called  unconstrained  least-squares  importance  fitting  (uLSIF;  Kanamori  et  ah,  2009a), 
which  will  be  used  in  the  following  sections.  Since  this  section  is  devoted  to  reviewing 
uLSIF,  those  who  are  familiar  with  it  may  skip  this  section  and  directly  go  to  the  next 
section. 

2.1  Formulation  of  Density  Ratio  Estimation 

Suppose  we  are  given  a  set  of  samples 

X  :=  {Xi\Xi  G  Rd}f=1 

drawn  independently  from  a  probability  distribution  P  with  density  p(X),  and  another 
set  of  samples 

x'  :=  e  md}]'=1 

drawn  independently  from  (possibly)  another  probability  distribution  P'  with  density 
pfX): 
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The  goal  of  density  ratio  estimation  is  to  estimate  the  density  ratio  function 

p(x) 


r(x  :  = 


p'(x) 


(1) 


from  the  samples  X  and  X' ,  where  we  assume  p'{x)  >  0  for  all  x. 


2.2  Least-Squares  Approach  to  Density  Ratio  Estimation 

Let  us  model  the  density  ratio  function  r( x)  by  the  following  kernel  model1: 

n 

r(x)  :  =  «0  +  ^2  OiiK(x,  Xi) 

2—1 

=  ctTk(x)i 

where 


ol  («o,  oq, . . . ,  an+ i)T 

are  parameters  to  be  learned  from  data  samples,  T  denotes  the  transpose  of  a  matrix  or 
a  vector, 

k(x)  :=  (1,  K(x,  xi), . . . ,  K(x,  xn))T 

are  kernel  basis  functions.  A  popular  choice  of  the  kernel  is  the  Gaussian  function: 

K(x,  x')  =  exp  ^  ;  (2) 


where  a2  denotes  the  Gaussian  variance. 

We  determine  the  parameter  a  in  the  model  r(x)  so  that  the  following  squared-error 
J0  is  minimized: 


Jn(ot)  — 


r(x 


r(x))2 p  (x)dx 


—  I  r(x)2p'(x)dx  —  /  r(x)p(x)dx  H —  /  r(x)p(x)dx , 


where  the  last  term  is  a  constant  and  therefore  can  be  safely  ignored.  Let  us  denote  the 
first  two  terms  by  J : 


J(a) 


j  r(x)2p'(x)dx 

-ol1  Hol  —  h  o . 

2 


r(x)p(x)dx 


(3) 


1We  included  the  constant  basis  function,  1,  in  our  model,  which  is  different  from  the  original  uLSIF 

paper  (Kanamori  et  ah,  2009a).  In  the  context  of  two-sample  test,  we  empirically  found  that  including  the 
constant  basis  tends  to  improve  the  estimation  accuracy  since  the  density  ratio  function  we  approximate 
can  be  close  to  constant  (i.e.,  r(x)  «  1)  when  the  two  distributions  are  similar. 
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where  H  is  the  (n  +  1)  x  (n  +  1)  matrix  defined  by 


II  :  = 


k(x)k(x)Tp'(x)dx, 


and  h  is  the  (n  +  l)-dimensional  vector  defined  by 


h  :  = 


k(x)p(x)dx. 


2.3  Empirical  Approximation 

Since  J  contains  the  expectation  over  unknown  densities  p( x)  and  p'(x),  we  approximate 
the  expectations  by  empirical  averages.  Then  we  obtain 


^  1 

J(«)  :=^EfK-)2 

3= 1 


n 

-  ?^Xi 
n  ' 

i=  1 


=  (x  Hcx  —  ex  h. 

2 


where  H  is  the  (n  +  1)  x  (n  +  1)  matrix  dehned  by 


H  ■=  777  5Z 

3= 1 

and  h  is  the  (n  +  l)-dimensional  vector  dehned  by 

1  n 

h:=-^2kM-  (4) 

i=l 

By  including  a  regularization  term,  the  uLSIF  optimization  problem  is  formulated  as 
follows. 

ol  :=  argmin 

CX 

where  q:tq:/2  is  a  regularizer  and  A  (>  0)  is  the  regularization  parameter  that  controls 
the  strength  of  regularization.  By  taking  the  derivative  of  the  above  objective  function 
with  respect  to  the  parameter  ot  and  equating  it  to  zero,  we  can  analytically  obtain  the 
solution  a  as 

OL  =  (H  +  A  In+1)-%  (6) 


(x  Hex  —  ex  h  +  —  aTa 

2  2 


(5) 


where  In+ 1  is  the  (n  + l)-dimensional  identity  matrix.  Finally,  the  density  ratio  estimator 
r(x)  is  given  by 

r(x)  :=  cx1  k(x). 
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Thanks  to  the  analytic-form  expression,  uLSIF  is  computationally  more  efficient  than 
alternative  density  ratio  estimators  which  involve  non-linear  optimization  (Qin,  1998; 
Cheng  &  Chu,  2004;  Huang  et  ah,  2007;  Sugiyama  et  ah,  2008;  Nguyen  et  ah,  2010). 
It  was  theoretically  shown  that  uLSIF  possesses  the  optimal  non-parametric  convergence 
rate  and  optimal  numerical  stability  (Kanamori  et  ah,  2009b). 


2.4  Model  Selection  by  Cross-Validation 

The  practical  performance  of  uLSIF  depends  on  the  choice  of  the  kernel  function  (the 
kernel  width  a  in  the  case  of  Gaussian  kernel  (2))  and  the  regularization  parameter  A. 
Model  selection  of  uLSIF  is  possible  based  on  cross-validation  with  respect  to  the  error 
criterion  J  defined  by  Eq.(3)  (Kanamori  et  ah,  2009a). 

More  specifically,  each  of  the  sample  sets  X  =  {cc;}”=1  and  X'  =  {cc'}™=1  is  divided 
into  M  disjoint  sets2  {Xm}^=1  and  {4^}0f=1.  Then  an  uLSIF  solution  rm(x)  is  obtained 
using  X\Xm  and  X'\X'm  (i.e. ,  all  samples  without  Xm  and  X^),  and  its  J- value  for  the 
hold-out  samples  Xm  and  X^  is  computed  as 


Tcv 


X'ex' 


(x 


l\ 2 


rm(x), 


where  \X\  denotes  the  number  of  elements  in  the  set  X.  This  procedure  is  repeated  for 
m  =  1, . . . ,  M,  and  the  average  of  ,/r(„v  over  all  m  is  computed  as 

M 

JCV  :=  A  y  jp. 

m= 1 

Finally,  the  model  (the  kernel  width  a  and  the  regularization  parameter  A  in  the  current 
setup)  that  minimizes  Jcv  is  chosen  as  the  most  suitable  one. 


3  Divergence  Estimation 

In  this  section,  we  describe  a  divergence  estimator  based  on  uLSIF,  and  investigate  its 
theoretical  properties. 


2M  =  5  seems  to  be  a  popular  choice  (Hastie  et  al.,  2001).  We  also  follow  this  ‘rule-of-thumb’  choice 
in  this  paper. 
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3.1  Formulation  of  Divergence  Estimation 

Let  us  consider  the  Pearson  divergence  (Pearson,  1900)  from  P  to  P'  as  a  discrepancy 
measure  between  P  and  P',  which  is  defined  and  expressed  as  follows: 

=  ^  J  r(x)p(x)dx  -  J  r(x)p'(x)dx +  ^,  (7) 


where  r(x)  is  the  density  ratio  function  defined  by 


r(x) 


P(x) 

p'(x)  ’ 


PE(P,  P')  vanishes  if  and  only  if  P  =  P’ .  The  Pearson  divergence  is  a  squared-loss 
variant  of  the  Kullback-Leibler  divergence  (Kullback  &  Leibler,  1951),  and  is  an  instance 
of  the  f- divergences ,  which  are  also  known  as  the  Csiszar  f -divergences  (Csiszar,  1967) 
or  the  Ali-Silvey  distances  (Ali  &  Silvey,  1966). 


3.2  uLSIF-based  Pearson  Divergence  Estimation 

Approximating  the  expectations  in  Eq.(7)  by  empirical  averages  and  replacing  the  density 
ratio  function  r(x)  by  an  uLSIF-based  estimator  r(x),  we  have  the  following  Pearson 
divergence  estimator: 


PE(A,A') 


1 

2  n 


n 

^f(aq) 

i— 1 


1 

2 


l^TV  -Tp  . 

-a  h,  —  a.  ri  + 

2 


1 

2’ 


(8) 


where  a  is  given  by  Eq.(6),  h  is  dehned  by  Eq.(4),  and  h  is  the  (n  +  l)-dimensional 
vector  dehned  by 


h 


Note  that  PE(A,  X')  can  take  a  negative  value,  although  the  true  PE (P,  P')  is  non¬ 
negative  by  definition.  Thus,  the  estimation  accuracy  of  PE(A,  X')  can  be  improved  by 
taking  its  positive  part  by  rounding  up  a  negative  estimate  to  zero.  However,  we  do  not 
employ  this  rounding-up  strategy  here  since  we  are  interested  in  the  relative  ranking  of 
the  divergence  estimates,  as  explained  in  Section  4.1. 
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3.3  Theoretical  Properties 


Here,  let  us  theoretically  investigate  asymptotic  properties  of  the  uLSIF-based  Pearson 
divergence  estimator  PE(H’,H’/).  More  specifically,  we  show  the  asymptotic  convergence 
rate  of  our  non-parametric  estimator  PE(A,  A')  to  the  true  PE(P,  P'). 

Since  the  derivation  of  the  convergence  rate  is  highly  technical,  we  defer  all  the  tech¬ 
nical  details  in  Appendix  A.  Here,  we  focus  on  explaining  the  insight  we  can  gain  from 
our  theoretical  analysis. 


Theorem  1.  Under  the  technical  assumptions  described  in  Appendix  A,  we  have 

|PE(A ,  X')  —  PE(P,  P')\  =  Op  \  [  — )  +C(^)  ),  (9) 


n 


n 


where 


C  :  = 


r(x)  —  l)2  p'(x)dx. 


(10) 


Op  denotes  the  asymptotic  order  in  probability,  n  :=  min(n,  n'),  and  7  (t)  <  7  <  1 )  is  a 
constant  determined  by  the  kernel  function  K (•,  •). 

The  above  theorem  means  that  the  convergence  rate  of  PE  (A,  A")  to  PE(P,  P')  is 
_  1 

(^p) 2+7  in  general.  However,  when  the  two  distributions  P  and  P'  are  the  same,  r(x)  =  1 
and  thus  C  =  0  (see  Eq.(10)).  Then,  the  Op  2+7  j -term  in  Eq.(9)  disappears,  and 

therefore  our  estimator  possesses  au  even  faster  convergence  rate  Op  ((¥)*)• 


4  Least-Squares  Two-Sample  Test 

Theoretical  properties  of  our  Pearson  divergence  estimator  PE(T,  X’)  have  been  eluci¬ 
dated  above.  In  this  section,  we  propose  a  two-sample  test  based  on  PE(X,X').  We 
first  describe  a  basic  procedure  of  our  two-sample  test,  and  study  its  theoretical  proper¬ 
ties.  Then  we  illustrate  its  behavior  using  toy  datasets,  and  discuss  practical  issues  for 
improving  the  performance. 


4.1  Permutation  Test  with  Finite  Samples 

Our  two-sample  test  is  based  on  the  permutation  test  (Efron  &  Tibshirani,  1993). 

We  first  run  the  uLSIF-based  Pearson  divergence  estimation  procedure  using  the  orig¬ 
inal  datasets  X  and  X',  and  obtain  a  Pearson  divergence  estimate  PE(X ,  X').  Next,  we 
randomly  permute  the  \X  U  X'\  samples,  and  assign  the  first  \X\  samples  to  a  set  X  and 
the  remaining  \X'\  samples  to  another  set  X' .  Then  we  run  the  uLSIF-based  Pearson 
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a 
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p 


PK(X,  X') 


Figure  1:  The  role  of  the  variables  a  and  /3  in  Theorem  2. 


divergence  estimation  procedure  again  using  the  randomly  shuffled  datasets  X  and  X', 
and  obtain  a  Pearson  divergence  estimate  PE(A,  X').  Since  X  and  X'  can  be  regarded  as 
being  drawn  from  the  same  distribution,  PE(A,  X')  would  take  a  value  close  to  zero.  This 
random  shuffling  procedure  is  repeated  many  times,  and  the  distribution  of  PE(A,  X') 
under  the  null- hypothesis  (i.e. ,  the  two  distributions  are  the  same)  is  constructed.  Fi¬ 
nally,  the  p-value  is  approximated  by  evaluating  the  relative  ranking  of  PE(A,  X')  in  the 
distribution  of  PE(A,  X'). 

We  refer  to  this  procedure  as  the  least- squares  two-sample  test  (LSTT). 

4.2  Theoretical  Properties 

Here,  we  investigate  theoretical  properties  of  the  above  permutation  procedure  under  the 
null-hypothesis  P  =  P' . 

Theorem  2.  Suppose  \X\  =  \X'\,  and  let  F  be  the  distribution  function  of  PE(T,T'). 
Let 


(3  :=  sup{t  e  R.  |  F(t)  <  1  —  a} 

be  the  upper  100a  -percentile  point  of  F  ( see  Figure  1).  If  P  =  P' ,  we  have 

Prob  (PE(X,X')  >  /?)  <  a, 

where  ‘Prob(e) ;  denotes  the  probability  of  an  event  e. 

A  proof  of  Theorem  2  is  provided  in  Appendix  B. 

Theorem  2  means  that,  for  a  given  significance  level3  a,  the  probability  that  PE(A,  X') 
exceeds  /3  is  at  most  a  when  P  =  P'.  Thus,  when  the  null  hypothesis  is  correct,  it  will 
be  properly  accepted  with  a  specified  probability. 


3 Conventionally,  a  =  0.01  or  0.05  is  used. 
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4.3  Numerical  Examples 

Let  the  number  of  samples  be  n  =  n'  =  500,  and 

*  =  {*iir=1  i-~p  =  N(  o,i), 

P'  =  N(jjl,o3), 

where  7V(/x,  cr2)  denotes  the  normal  distribution  with  mean  /x  and  variance  a2.  We  consider 
the  following  four  setups: 

(a)  (/x,  a)  =  (0, 1.3):  P'  has  larger  standard  deviation  than  P, 

(b)  (/x,  a)  =  (0,0.7):  P'  has  smaller  standard  deviation  than  P, 

(c)  (/x,  cr)  =  (0.3, 1):  P  and  P'  have  different  means, 

(d)  (/x,  cr)  =  (0, 1):  P  and  P'  are  the  same. 

Histograms  of  X  —  {xi}^=l  and  X'  =  {cc'  }"^  for  the  above  four  cases  are  depicted  in 
Figure  2.  Examples  of  randomly  shuffled  samples  X  are  also  plotted  at  the  bottom,  where 
X  is  thought  to  follow  ^N( 0, 1)  +  ^fV(/x,  cr2).  Since  X'  has  a  similar  histogram  to  X,  its 
plot  is  omitted. 

Figure  3  depicts  histograms  of  PE^,^')  (i.e.,  shuffled  datasets),  showing  that  the 
profiles  of  the  null  distribution  (i.e.,  the  two  distributions  are  the  same)  are  rather  similar 
to  each  other  for  the  four  cases.  The  values  of  PE(A’,4’/)  (i.e.,  the  original  datasets)  are 
also  plotted  in  Figure  3  using  the  ‘x ’-symbol  on  the  horizontal  axis,  showing  that  the 
p- values  tends  to  be  small  when  P  ^  P'  and  the  p- value  is  large  when  P  =  P' .  This  is 
desirable  behavior  as  a  hypothesis  test. 

Figure  4  depicts  the  mean  and  standard  deviation  of  p- values  over  100  runs  as  functions 
of  the  sample  size  n  (=  n'),  indicated  by  ‘plain’.  The  graphs  show  that,  when  P  ^  P',  the 
p-values  tend  to  decrease  as  n  increases.  On  the  other  hand,  when  P  =  P',  the  p-values 
are  almost  unchanged  and  kept  to  relatively  large  values. 

Figure  5  depicts  the  rate  of  accepting  the  null  hypothesis  (i.e.,  P  =  P')  over  100  runs 
when  the  significance  level  is  set  to  0.05  (i.e.,  the  rate  of  p-values  larger  than  0.05).  The 
graphs  show  that,  when  P  ^  P',  the  null  hypothesis  tends  to  be  more  frequently  rejected 
as  n  increases.  On  the  other  hand,  when  P  =  P',  the  null  hypothesis  is  almost  always 
accepted.  Thus,  the  proposed  test  was  shown  to  work  properly  for  these  toy  datasets. 

4.4  Choice  of  Numerator/Denominator  Densities 

In  our  test  procedure,  we  are  using  uLSIF  for  estimating  the  density  ratio  function  r(x): 
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(c)  (m,  =  (0.3, 1)  (d)  (p,  cr)  =  (0, 1) 


Figure  2:  Histograms  of  original  samples  X  ~  iV(0, 1)  and  df'  ~  7V(/i,  cr2),  and  the  shuffled 
samples  (which  are  thought  to  follow  X  ~  |fV(0, 1)  +  a2))  for  the  toy  datasets. 


PE 


(a)  (p,a)  =  (0, 1.3) 


PE 


_  |  |  ,  ,  ,  , _ | _ , _ , _ ,  f _ m 

0  0.05  0.1  015  02  025  03  *T).35  04 

PE 


(b)  0,cr)  -  (0,0.7) 


(c)  (/i,  cr)  =  (0.3,  1)  (d)  (/i,  cr)  =  (0,  1) 

Figure  3:  Histograms  of  PE(X,X')  (i.e.,  shuffled  datasets)  for  the  toy  datasets,  ‘x’ 
indicates  the  value  of  PE^,  X')  (i.e.,  the  original  datasets). 
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(a)  (p,a)  =  (0, 1.3) 


n 


(b)  0,cr)  =  (0,0.7) 


0^ - * - * - * - *— 

100  200  300  400  500 

n 


(c)  {p,  cr)  =  (0.3, 1)  (d)  (p,  a)  =  (0, 1) 

Figure  4:  Mean  and  standard  deviation  of  p- values  for  the  toy  datasets. 


(a)  (p,a)  =  (0, 1.3) 
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(b)  (p,<j)  =  (0,0.7) 
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(c)  {p,o)  =  (0.3, 1)  (d)  (/x,  cr)  =  (0, 1) 

Figure  5:  The  rate  of  accepting  the  null  hypothesis  (i.e.,  P  =  P')  for  the  toy  datasets 
under  the  significance  level  0.05. 
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By  definition,  the  reciprocal  of  the  density  ratio  r(x ), 

1  p'(x) 

r{x )  p{x)  ’ 

is  also  a  density  ratio  function,  assuming  that  p(x)  >  0  for  all  x.  This  means  that  we  can 
use  uLSIF  in  two  ways,  either  estimating  the  original  density  ratio  r(x)  or  its  reciprocal 
1  /r(x). 

To  illustrate  this  difference,  we  also  carried  out  the  same  experiments  as  Section  4.3 
by  swapping  X  and  X' .  The  obtained  p-values  and  the  acceptance  rate  are  also  plotted 
in  Figure  4  and  Figure  5  as  ‘reciprocal’.  In  the  experiments,  we  prefer  to  have  smaller 
p-values  when  P  ^  P'  and  larger  p-values  when  P  =  P'.  The  graphs  show  that,  when 
(/i,  a)  =  (0,1.3),  estimating  the  inverted  density  ratio  gives  slightly  smaller  p-values 
and  a  significantly  lower  acceptance  rate.  On  the  other  hand,  when  (/i,  a)  =  (0,0.7), 
reciprocal  estimation  yields  larger  p-values  and  a  significantly  higher  acceptance  rate. 
When  (/i,  cr)  =  (0.3,1)  and  (/x,  a)  =  (0,1),  the  ‘plain’  and  ‘reciprocal’  methods  result 
in  similar  p-values  and  thus  similar  acceptance  rates.  These  experimental  results  imply 
that,  if  we  adaptively  choose  the  plain  and  reciprocal  approaches,  the  performance  of 
homogeneity  test  may  be  improved. 

Figure  4  showed  that,  when  P  =  P'  (i.e. ,  (p,  a)  =  (0, 1)),  the  p-values  are  large  enough 
to  reject  the  null  hypothesis  for  both  the  plain  and  reciprocal  approaches.  Thus,  the  type- 
I  error  (the  probability  of  rejecting  correct  null- hypotheses,  i.e.,  two  distributions  are 
judged  to  be  different  when  they  are  actually  the  same)  would  be  sufficiently  small  for 
both  approaches,  as  illustrated  in  Figure  5.  Based  on  this  observation,  we  propose  to 
choose  a  smaller  p-value  between  the  plain  and  reciprocal  approaches.  This  allows  us  to 
reduce  the  type-II  error  (the  probability  of  accepting  incorrect  null-hypotheses,  i.e.,  two 
distributions  are  judged  to  be  the  same  when  they  are  actually  different),  and  thus  the 
power  of  the  test  can  be  enhanced. 

The  experimental  results  of  this  adaptive  method  are  also  included  in  Figure  4  and 
Figure  5  as  ‘adaptive’.  The  results  show  that  p-values  obtained  by  the  adaptive  method  are 
smaller  than  those  obtained  by  the  plain  and  reciprocal  approaches,  providing  significant 
performance  improvement  when  P  ^  P' .  On  the  other  hand,  smaller  p-values  can  be 
problematic  when  P  =  P'  since  the  acceptance  rate  can  be  lowered,  ffowever,  as  the 
experimental  results  show,  the  p-values  are  still  large  enough  to  accept  the  null  hypothesis 
and  thus  there  is  no  critical  performance  degradation  in  this  illustrative  example. 

A  pseudo-code  of  the  ‘adaptive’  LSTT  method  is  summarized  in  Figure  6  and  Fig¬ 
ure  7.  Although  the  permutation  test  process  is  computationally  intensive,  it  can  be  easily 
parallelized  using  multi-processors/cores. 

A  MATLAB®  implementation  of  LSTT  is  available  from 

http://sugiyama-www.cs.titech.ac.jp/~sugi/software/LSTT/ 
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Input:  Two  sets  of  samples  X  =  {xi}™=1  and  X'  = 

Output:  p- value  p 

p0^PE(X,Xr); 

*/0<— PE(*',*); 

For  t  —  1, . . . ,  T 

Randomly  split  X  U  X'  into  X  of  size  | X |  and  X'  of  size  \X'\\ 
pt^PE(X,Xj; 

Pt  t  PE(A’/,A’); 

End 

1  T 

p< —  f'%2i(pt  >poy, 

t= i 

p'  <—  j,  ^p't  >  p'^ 

t= i 

p  i —  minljyj/); 

Figure  6:  Pseudo  code  of  LSTT.  Pseudo  code  of  PE(T’,A")  is  given  in  Figure  7.  1(c) 
denotes  the  indicator  function,  i.e. ,  1(c)  —  1  if  the  condition  c  is  true;  otherwise  1(c)  =  0. 
When  \X\  =  \X'\  (i.e.,  n  =  n'),  p't  < —  PE(T’,,T’)  may  be  replaced  by  p't  < —  pt  since 
switching  X  and  X'  does  not  essentially  affect  the  estimation  of  the  Pearson  divergence. 


5  Maximum  Mean  Discrepancy 

Maximum  mean  discrepancy  (MMD;  Borgwardt  et  ah,  2006;  Gretton  et  ah,  2007)  is  a 
state-of-the-art  method  of  homogeneity  test.  In  this  section,  we  review  the  definition 
of  MMD  and  explain  its  basic  properties.  In  the  next  section,  the  proposed  LSTT  is 
experimentally  compared  with  MMD. 

MMD  is  an  integral  probability  metric  (Muller,  1997)  defined  as 


MMD (n,P,P')  :=  sup 
Sen 


f(x)p(x)dx  —  /  f(x)pr (x)dx 


(11) 


where  TL  :  — )■  M  is  some  function  class.  When  TL  is  a  unit  ball  in  a  universal  reproducing 
kernel  Hilbert  space  (universal  RKHS;  Steinwart,  2001  defined  on  a  compact  metric  space, 
then  MMD('H,  P ,  P')  vanishes  if  and  only  if  P  =  P' .  Gaussian  RKHSs  are  examples  of 
the  universal  RKHS. 

Let  K (x,  x')  be  a  reproducing  kernel  function.  Then  the  reproducing  property  (Aron- 
szajn,  1950)  allows  one  to  extract  the  value  of  a  function  f  E  TL  at  a  point  x  as 


/(x)  =  (/(.)  ,K(x,-))h, 


(12) 


where  (•,  -)H  denotes  the  inner  product  in  the  RKHS  TL.  Let  ||-||w  be  the  norm  in  the 
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Input:  Two  sets  of  samples  X  =  {xi}^=l  and  X'  = 
Output:  Pearson  divergence  estimate  PE(T’,  X') 

Randomly  split  X  into  {Xm}^=1  and  X'  into 
For  each  candidate  of  Gaussian  width  a 

For  m  —  1, . . . ,  M 

%  ka(x)  (1)  Ka(x^  3Ji),  •  •  ■  ,  3^n)) 

%  Ka(x,  x')  =  exp  (-11— ;"2) 

Gm  4 —  ^  K{x')krT(x')T ; 

*'64 

9m  < -  M*); 

End 

For  each  candidate  of  regularization  parameter  A 


For  m  —  1, . . . ,  M 


Figure  7:  Pseudo  code  of  uLSIF-based  Pearson  divergence  estimator. 
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RKHS  TL.  Then,  one  can  explicitly  express  MMD  in  terms  of  the  kernel  function  as 


MMD  (U,P,P')=  sup 


(f{-),K(x,-))up(x)dx^  /  (/(•),  K(x,  -))Hp'(x)dx 


=  sup  (/(•),  /  K(x,-)p(x) dx-  /  K(x,  -)p'(x)dx 


n 


K(x,  -)p(x)dx  —  /  K(x,-)p'(x)dx 


H 


where  the  Cauchy- Schwarz  inequality  (Bachman  &  Narici,  2000)  was  used  in  the  last 
equality.  Furthermore,  by  using 

K(x,x)  =  (K(x,-),K(x',-))n, 

the  squared  MMD  can  be  expressed  as 


MMD2^,?,?')  = 


K(x, -)p(x)dx  —  /  K (x,  -)p'(x)dx 


u 


K (x,  x')p(x)p(x')dxdx'  +  //  K (x,  x')p' (x)p' (x')dxdx' 


K(x,  x')p(x)p' (x)dxdx' . 


The  above  expression  allows  one  to  immediately  obtain  an  empirical  estimator — with 
the  i.i.d.  samples  X  =  {xj}”=1  following  p(x)  and  X'  =  {*)}”=  i  following  p'(x),  a  consis¬ 
tent  estimator  of  MMD2("H,  P,  P')  is  given  as 


_ _  -  At  -  At 

MMD2('H,  X ,  X')  --zYl  K(xi’xi')  + 

i,i'= 1  jj'=l 


nrv 


/  yzy ^^(xuxj)- 


i=l  j= 1 


By  the  same  permutation  test  procedure  as  the  one  described  in  Section  4.1,  one 
can  compute  p-values  for  MMD2('H,  X,  X').  Furthermore,  an  asymptotic  distribution 

of  MMD 2(TL,X,X')  under  P  =  P'  can  be  explicitly  obtained  (Borgwardt  et  al.,  2006; 
Gretton  et  ah,  2007).  This  allows  one  to  compute  the  p-values  without  resorting  to  the 
computationally-intensive  permutation  procedure,  which  is  an  advantage  of  MMD  over 
LSTTh^ 

MMD2(P,  X,  X')  depends  on  the  choice  of  the  universal  RKHS  TL.  In  the  original 
MMD  papers  (Borgwardt  et  ah,  2006;  Gretton  et  ah,  2007),  the  Gaussian  RKHS  with 
width  set  to  the  median  distance  between  samples  was  used,  which  is  a  popular  heuristic  in 
the  kernel  method  community  (Scholkopf  &  Smola,  2002).  Recently,  an  idea  of  using  the 


Least-Squares  Two-Sample  Test 


18 


universal  RKHS  yielding  the  maximum  MMD  value  has  been  introduced  (Sriperumbudur 
et  ah,  2009).  In  the  experiments  in  the  next  section,  we  use  this  maximum-MMD  technique 
for  choosing  the  universal  RKHS,  which  we  confirmed  to  work  better  than  the  ‘median’ 
heuristic. 


6  Experiments 

In  this  section,  we  report  experimental  results  comparing  the  performance  of  the  proposed 
LSTT  (Section  4)  with  that  of  the  state-of-the-art  MMD  (Section  5). 

6.1  IDA  Benchmark  Datasets 

In  the  first  set  of  experiments,  we  used  binary  classification  datasets  taken  from  the  IDA 
repository  (Ratsch  et  al.,  2001).  For  each  dataset,  we  randomly  split  all  the  positive 
training  samples  into  two  disjoint  sets,  X  and  X'  with  \X\  =  \X'\. 

We  first  investigated  whether  the  tests  can  correctly  accept  the  null  hypotheses  (i.e. , 
X  and  X'  follow  the  same  distribution).  We  used  the  Gaussian  kernel  both  for  LSTT  and 
MMD.  The  Gaussian  width  and  the  regularization  parameter  in  LSTT  were  determined 
by  5-fold  cross-validation  (see  Section  2.4).  The  Gaussian  width  in  MMD  was  chosen  so 
that  the  MMD  value  is  maximized  (see  Section  5).  Since  the  permutation  test  procedures 
in  LSTT  and  MMD  are  exactly  the  same,  we  are  purely  comparing  the  performance  of 
the  MMD  and  LSTT  criteria  in  this  experiment. 

We  investigated  the  rate  of  accepting  the  null  hypothesis  as  functions  of  the  relative 
sample  size  77  for  the  significance  level  0.05.  The  relative  sample  size  //  means  that  we  used 
samples  of  size  rf\X |  and  r/\X'\  for  homogeneity  test.  The  experimental  results  are  plotted 
in  Figure  8  by  lines  with  ‘o’-symbols.  The  results  show  that  both  methods  almost  always 
accepted  the  null  hypothesis  correctly,  meaning  that  the  type-I  error  is  small  enough  for 
both  MMD  and  LSTT.  However,  MMD  seems  to  perform  slightly  better  than  LSTT  in 
terms  of  the  type-I  error. 

Next,  we  replaced  a  fraction  of  samples  in  the  set  X '  by  randomly  chosen  negative 
training  samples.  Thus,  while  X  contains  only  positive  training  samples,  X'  includes 
both  positive  and  negative  training  samples.  The  experimental  results  are  also  plotted 
in  Figure  8  by  lines  with  ‘x’-symbols.  The  results  show  that  LSTT  tended  to  correctly 
reject  the  null  hypothesis  more  frequently  than  MMD  for  the  ‘banana’,  ‘ringnorm’,  ‘splice’, 
‘twonorm’,  and  ‘waveform’  datasets.  MMD  worked  better  than  LSTT  for  the  ‘thyroid’ 
dataset,  and  the  two  methods  were  comparable  to  each  other  for  the  other  datasets. 
Overall,  LSTT  compares  favorably  with  MMD  in  terms  of  the  type-II  error. 

6.2  USPS  Hand- Written  Digit  Dataset 

In  the  second  sets  of  experiments,  we  used  the  USPS  hand-written  digit  dataset  provided 
by  U.S.  Postal  Service  (Hastie  et  ah,  2001).  Each  digit  image  (representing  an  integer  in 
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(g)  Image  (h)  Ringnorm  (i)  Splice 


(j)  Thyroid 


(k)  Twonorm 


(1)  Waveform 


Figure  8:  The  rate  of  accepting  the  null  hypothesis  (i.e.,  P  =  P')  for  IDA  datasets  under 
the  significance  level  0.05.  ij  indicates  the  relative  sample  size  we  used  in  the  experiments. 


Least-Squares  Two-Sample  Test 


20 


{0, 1,  2, ,  9})  consists  of  256  (=  16  x  16)  pixels,  each  of  which  takes  a  value  between  — 1 
to  +1  representing  its  intensity  level  in  gray-scale. 

We  formed  two  sets  of  samples  as  follows:  one  consists  of  500  samples  randomly  chosen 
from  class  c  (e  {0, 1,  2, ... ,  9}),  while  the  other  consists  of  500(1  —  5)  samples  randomly 
chosen  from  class  c  and  5005  samples  randomly  chosen  from  another  class  d  c),  where 
5  is  the  contamination  rate.  The  goal  is  to  test  whether  the  two  sets  of  samples  are  drawn 
from  the  same  distribution  or  not  for  various  contamination  rates. 

Table  1  shows  the  number  of  times  LSTT  or  MMD  incorrectly  rejected  the  null  hy¬ 
pothesis  over  10  runs  when  the  null  hypothesis  is  correct  (i.e. ,  5  =  0,  meaning  that  the  two 
distributions  are  the  same).  Thus,  the  smaller  the  number  is,  the  better  the  performance 
is.  The  significance  level  was  set  to  0.05.  The  format  ‘//m’  in  the  table  means  that  LSTT 
and  MMD  rejected  the  null  hypothesis  l  and  m  times,  respectively.  The  results  show  that 
both  LSTT  and  MMD  almost  always  accepted  the  correct  null  hypothesis  successfully. 

Next,  we  compared  the  performance  of  LSTT  and  MMD  when  the  contamination  rate 
was  increased  as  5  =  0.02,  0.04, 0.06, . . . ,  0.2.  Table  2  shows  the  number  of  times  LSTT  or 
MMD  rejected  the  null  hypothesis  with  a  lower  contamination  rate  5.  The  format  ll/t/m' 
in  the  table  means  that  LSTT  rejected  the  null  hypothesis  with  a  lower  contamination 
rate  5  than  MMD  l  times,  and  vice  versa  for  m  times,  t  denotes  the  number  of  times 
the  smallest  5  that  LSTT  and  MMD  rejected  the  null  hypothesis  was  the  same.  The 
significance  level  was  set  to  0.05.  The  results  show  that  LSTT  tended  to  reject  the  null 
hypothesis  with  low  contamination  rate  5. 

6.3  Brown  Corpus  Dataset  with  Tree  Kernels 

In  the  last  set  of  experiments,  we  compared  the  performance  of  LSTT  and  MMD  using 
natural  language  datasets. 

We  used  the  Brown  corpus  dataset4,  which  is  a  carefully  compiled  selection  of  current 
American  English.  The  Brown  corpus  consists  of  a  million  words  sampled  from  15  genres 
such  as  news  and  religion,  and  it  is  accompanied  with  part-of-speech  tags,  which  represent 
relationship  with  adjacent  and  related  words  in  a  phrase,  sentence,  or  paragraph.  We 
converted  the  Brown  corpus  data  to  dependency  tree  representation  by  the  MaltParser5 . 

We  prepared  two  sets  of  dependency  trees  as  follows:  one  consists  of  1000  samples 
taken  from  the  ‘news’  category,  and  the  other  consists  of  1000(1  —  5)  samples  taken  from 
the  ‘news’  category  and  10005  samples  taken  from  the  ‘romance’  category,  where  5  is  the 
contamination  rate.  The  goal  is  to  test  whether  the  two  sets  of  samples  were  drawn  from 
the  same  distribution  or  not  for  various  contamination  rates. 

We  computed  the  labeled  ordered  tree  kernel  (Kashima  &  Koyanagi,  2002)  between 
two  dependency  trees,  which  counts  the  number  of  sub-trees  common  to  both  trees.  Then 


4The  Brown  corpus  dataset  can  be  downloaded  by  using  the  Natural  Language  Toolkit ,  which  con¬ 
tains  open  source  Python  modules,  linguistic  data,  and  documentation  for  research  and  development 
in  natural  language  processing  and  text  analysis.  The  Natural  Language  Toolkit  is  available  from 
http://www.nltk.org/. 

5Thc  MaltParser  is  available  from  http://maltparser.org/. 
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Table  1:  The  experimental  results  for  the  USPS  datasets.  The  number  of  times  LSTT 
or  MMD  incorrectly  rejected  the  null  hypothesis  over  10  runs  when  the  null  hypothesis 
was  correct  (i.e.,  the  two  distributions  are  the  same),  c  in  the  table  denotes  the  target 
class.  The  format  il/m'  means  that  LSTT  and  MMD  rejected  the  null  hypothesis  l  and 
m  times,  respectively.  The  significance  level  was  set  to  0.05. 


c 

0123456789 

0/1  1/0  1/0  0/0  0/1  1/0  1/0  0/1  0/0  1/0 

Table  2:  The  experimental  results  for  the  USPS  datasets.  The  number  of  times  LSTT  or 
MMD  rejected  the  null  hypothesis  with  a  smaller  contamination  rate,  c  denotes  the  target 
class  and  c!  denotes  the  contamination  class.  The  format  ‘ l/t/m ’  means  that  LSTT /MMD 
rejected  the  null  hypothesis  with  a  smaller  contamination  rate  than  MMD/LSTT  l/m 
times,  while  the  smallest  contamination  rate  that  LSTT  and  MMD  rejected  the  null 
hypothesis  was  the  same  t  times.  The  significance  level  was  set  to  0.05.  The  numbers  are 
boldfaced  if  they  are  larger  than  or  equal  to  5. 
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the  kernel  values  were  directly  fed  into  the  LSTT  and  MMD  algorithms.  The  labeled 
ordered  tree  kernel  contains  the  decay  factor  parameter  7  (0  <  7  <  1),  which  controls 
the  weights  for  large  sub-trees  (Collins  &  Duffy,  2002).  We  computed  kernel  values  for 
7  =  0.1,  0.4,  0.7,  and  chose  the  one  that  minimized  the  cross-validation  score  in  the  case 
of  LSTT  and  the  one  that  maximized  the  MMD  value  in  the  case  of  MMD. 

We  first  investigated  the  number  of  times  LSTT  or  MMD  incorrectly  rejected  the 
null  hypothesis  when  the  null  hypothesis  was  correct  (i.e.,  <5  =  0,  meaning  that  the  two 
distributions  are  the  same).  Thus,  the  smaller  the  number  is,  the  better  the  performance 
is.  The  significance  level  was  set  to  0.05.  The  results  were  that  LSTT  rejected  the  correct 
null  hypothesis  30  times  out  of  100  runs,  while  MMD  rejected  the  correct  null  hypothesis 
only  8  times.  Thus  MMD  gave  smaller  type-I  error. 

Next,  we  compared  the  performance  of  LSTT  and  MMD  when  the  contamination  rate 
was  increased  as  6  =  0.05,  0.1,  0.15, ... ,  0.35.  The  significance  level  was  set  to  0.05.  The 
results  were  that  LSTT  rejected  the  null  hypothesis  with  a  lower  contamination  rate  6 
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than  MMD  60  times  out  of  100  runs,  while  MMD  rejected  the  null  hypothesis  with  a  lower 
contamination  rate  S  than  MMD  only  18  times;  The  smallest  5  that  LSTT  and  MMD 
rejected  the  null  hypothesis  was  the  same  22  times.  This  means  that  LSTT  tended  to 
reject  the  null  hypothesis  with  low  contamination  rate  <5. 


7  Conclusions 


We  proposed  a  novel  method  of  testing  homogeneity  called  the  least-squares  two-sample 
test  (LSTT).  Through  various  experiments,  we  overall  confirmed  that  LSTT  tends  to 
produce  smaller  type-11  error  than  the  state-of-the-art  MMD  method,  with  slightly  larger 
type-I  error. 

The  performance  of  LSTT  relies  on  the  accuracy  of  density  ratio  estimation.  We 
adopted  unconstrained,  least-squares  importance  fitting  (uLSIF;  Kanamori  et  ah,  2009a) 
since  it  possesses  the  optimal  non-parametric  convergence  rate  and  optimal  numerical 
stability  (Kanamori  et  ah,  2009b).  uLSIF  is  computationally  highly  efficient  thanks  to  the 
analytic-form  solution,  which  is  an  attractive  feature  in  the  computationally-demanding 
permutation  test  procedure.  Nevertheless,  the  permutation  test  procedure  is  still  time 
consuming,  so  speedup  is  an  important  future  research  topic. 

We  have  elucidated  the  convergence  rate  of  our  uLSIF-based  Pearson  divergence  es¬ 
timator.  We  further  showed  that  our  uLSIF-based  Pearson  divergence  estimator  even 
achieves  a  faster  convergence  rate  when  the  two  distributions  are  the  same.  An  important 
future  study  along  this  line  of  research  is  to  elucidate  the  asymptotic  distribution  of  the 
LSTT  estimator  so  that  homogeneity  testing  can  be  carried  out  analytically. 

Based  on  the  uLSIF  estimator  r(x),  we  constructed  a  consistent  Pearson  divergence 
estimator  given  by 


-t  n  -\  n 

,*')■= ^  Y 


1 

2 


i= 1  j= 1 

On  the  other  hand,  it  is  possible  to  construct  different  consistent  estimators,  e.g., 


PE(X,  X')  :  = 


Pe"(A,  X')  :  = 


1 

2  n 


n 

Y?^ 

i=  1 


l 

2’ 


1  n  -\  n 

■^E^P+^E^ 

3= 1  *=1 


1 

2' 


PE  (A,  X')  would  be  the  simplest  estimator,  while  PE  (A,  X')  can  be  obtained  as  the 
Legendre- Fenchel  dual  of  the  Pearson  divergence  (Nguyen  et  ah,  2010).  Investigating 
theoretical  and  experimental  performance  of  these  variants  in  terms  of  accuracy  and 
computational  efficiency  is  left  open  as  a  future  work. 

Recently,  novel  approaches  to  density  ratio  estimation  for  high- dimensional  problems 
have  been  explored  (Sugiyama  et  ah,  2010;  Yamada  et  ah,  2010;  Sugiyama  et  ah,  2011). 
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In  our  future  work,  we  would  like  to  incorporate  these  new  ideas  into  the  framework  of 
LSTT  and  see  how  the  test  performance  can  be  improved. 


Acknowledgment 

MS  was  supported  by  SCAT,  AOARD,  and  the  JST  PRESTO  program.  TS  was  supported 
by  MEXT  Grant-in-Aid  for  Young  Scientists  (B)  22700289.  TK  was  supported  by  MEXT 
Grant-in-Aid  for  Young  Scientists  20700251.  MK  was  supported  by  the  JST  PRESTO 
program. 


A  Proof  of  Theorem  1 

In  this  section,  we  prove  Theorem  1.  For  simplicity  we  consider  a  situation  where  n  =  n' . 
Even  if  n  ^  n',  the  following  proof  is  valid  for  n  :=  min(n,  n'). 

For  arbitrary  function  /,  let 


1  n 

Pnf  ■=  -^2f(Xi), 

n  L ^ 

2  —  1 

1  n 

Pnf  ■= 

j=  1 

Pf:=Ex^\f(x)], 

P'f-.=  Ex^p[f(x% 

Let  Q  be  a  reproducing  kernel  Hilbert  space  (RKHS)  corresponding  to  a  kernel  K  : 
M.d  x  M,d  — >  M,  and  the  estimated  density  ratio  g  is  defined  as  the  minimizer  of  the 
following  minimization  problem: 
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1 

arg  min  — 
g£G 


n 


1 

-  J>(«  + 


2 

g- 


The  estimated  Pearson  divergence  PE  is  computed  as 

mX,X')=\png~P£+1-. 


By  Mercer’s  theorem,  the  kernel  K(x,x')  has  the  following  spectrum  decomposition 
with  respect  to  p 

OO 

K(x,x')  =  y^efc(a;)/ifcefc(a;/), 

fc= i 


where  { is  an  orthogonal  system  in  L2(p'),  i.e. ,  Pel 
Define  JV(X)  as 


m)  ■=  X 

k=  1 


H'k 

Lk  +  X 


1  and  P(efcefc/)  =  0  for  k  ^  k'. 


We  assume  the  following  conditions: 
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•  sup xemdK(x,x)  <  1, 

•  The  constant  function  1  is  contained  in  Q\  1  E  G, 

•  The  true  density  ratio  p/p'  is  contained  in  Q :  p/p'  =  g*  E  G, 

•  There  exists  a  constant  0  <  7  <  1  such  that  the  spectrum  pk  of  the  kernel  decays 

_  2 

as  pk  <  ck  '' ,  where  c  is  a  positive  constant. 

Then  we  obtain  the  following  theorem  and  lemma  (these  are  more  precise  versions  of 
Theorem  1). 

Theorem  1’.  Under  the  assumption  described  above,  for  Xn  =  (- 2£^j2^2+7')j  we  pave 


|PE(T,  X')  -  PE (P,  P')  |  =  Op(^  2+7  +  y/P'(g*  -  l)2  ( 

Lemma  1.  Suppose  that  the  assumption  described  above  hold  and 

641og2(12/7)Af(An) 
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Then  we  have 
|PE(T,T')  -  PE (P,P') 
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+  \y/P'{g'~  l)2^128 log2(12/7)  ( 

with  probability  at  least  1  —  rj,  where 
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Before  proving  the  lemma,  we  introduce  the  following  proposition  that  is  a  part  of 
Proposition  2  in  Caponnetto  and  de  Vito  (2007). 
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Proposition  1.  Let  f  be  a  random  variable  taking  values  in  a  real  separable  Hilbert  space 
tC  on  a  probability  space  (fi,  IF,  P).  Assume  that  there  are  two  positive  constants  L  and  a 
such  that 


1151k  <  (r  a.s., 

E[||«|ia  < 


(15) 

(16) 


Then,  for  all  n  >  1  and  0  <  g  <  l,  it  holds  that 


Prob^j  ..1oj7l)~Pn 


1 

5x:«"0-ek] 


>  1  —  rj. 


(17) 


Proof  of  Lemma  1.  First  we  define  some  notation.  Let  Kx  be  an  element  of  Q  such  that 


{Kx,  f)  —  f(x) 

for  f  E  G  and  x  G  i.e.,  Kx(-)  =  K(x,  •)  as  an  element  of  Q.  We  define  Tpt  :  Q  — >  Q  as 

(g,Tp'f)  =  E  X’~j/[g(x,)f(x% 
for  f,g  €  G.  Similarly  we  define  Tp>  :  Q  Q  as 

1  n 

(9i  Tpf)  =  - 

3= 1 


Note  that  Tp>  =  Lx'~p'{Kx>K°,]  where  K°  is  the  adjoint  of  Kx.  Let  <fk  '■=  yJJTkZk-  Then 
{(f>k}kLi  is  a  complete  orthonormal  system  in  the  RKHS  Q ,  and  Tpi  can  be  represented  as 


Tp'  ^  (fkhkGk' 


k=  1 


Let  hi,  hi,  h2,  h2  G  Q  be 


hi  ■  Ea;/^p'  [ita;/]  ,  h  [  ^  ^  L  x’  ) 

77/  '  3 


3= 1 


h2  :=  E X~P[KX\  =  Ex/~p,[Kx,g*(x')]  =  E x>~p>[Kx,(Kx/,  g*)g]  =  Tpig*,  h2  =  -  ^  Kx,  - 


1=1 


Note  that  E[hi]  =  hi  and  E[h2]  =  h2,  and 

(hi,  f)  =  P'f,  $1,  f)  =  P'nf 1  (h2,  f)  =  Pf,  (%,  f)  =  Pnf.  (18) 

It  can  be  easily  checked  that 

g  —  ( Tp ’  +  An)  1h2. 
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Here  we  define 

S\n  =  (TP'  +  A„)  1h2. 

The  difference  between  PE(X,X')  and  PE(P,  P')  is  expanded  as 
PE(X,  X')  -  PE (P,  P') 

=  \(Pn9  ~Pg‘)~  (P'nS  -  P's‘) 

=  \[(Pn  -P)(g-g‘)  +  P(g  - g‘)  +  (P, -  P)g’l  -(Kg- 1)-  (is) 

Since  P(g  —  g*)  is  bounded  as 

I  p  (g  -  g‘)  I  =  \P'(g  -  s* )  I  +  in  (s*  - 1 )  (g  -  j*) )  I 

<  l^'(9-9,)l  +  VP'(g’  -  i)WP'(g-g')2 

=  I  (P1  -  P'n)(g  -  g’)  +  Kg-P'g'  +  (p‘  -  K)g’\ 

+  pPig"  - 1 iV-P'O-s*)2 

<  K-P'  -  K)(g-g’) I  +  Kg- 1|  + 1 (p1  -  K)g’\ 

+  V P'(<P  -  1)'V P'@  -  9')2.  (20) 

Eq.(19)  indicates 

|PE(*,*')  -  PE(P,P')|  <i|(p;  -  P')(9  -  9*) I  +  jl(P„  -  P)(g  -  s') I 

+  2 \P'ng  -  1 

+  \\(K-py/\  +  \\(Pn-P)g‘\ 

+  \  \/p'(g’  -  1  )2\/P'(g  -  g‘)2-  (21) 

Step  1.  Bounding  (P'n  —  P')(g  —  g *) 

(P>n-P>)(g-g*) 

=  (hi  —  hi,  ( Tpi  +  An)  *h2  —  g*) 

—  (hi  —  hi,  (Tpt  +  An)  1(h 2  —  h2)  +  (Tp/  +  An)  lh2  —  (Tp/  +  An)  1h2  +  (Tp/  +  An)  lh2  —  g*) 
=  (hi  —  hi,  ( Tp /  +  An)  1(h 2  —  h2))  +  (hi  —  hi,  (Tp/  +  A)  1(TP'  —  Tp/)(Tpi  +  An)  ^2) 

+  (hi  —  hi,  (Tp/  +  A„)  1h2  —  (7*) 

=  (hi  —  hi,  ( Tp /  +  An)  1(h2  —  h2))  +  (hi  —  hi,  (Tpi  +  An)  1(Tpi  —  Tp/)(Tpi  +  An)  1h2) 

(l-a)  (1-b) 

(hi  —  hi,  ( Tpi  +  A)  1Ang*), 

' - v - ' 

(1-c) 


(22) 
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where  in  the  last  inequality  we  used  the  relation  Tp/g*  =  h2. 

Let  ||  •  ||  c(g)  be  the  operator  norm  of  the  bounded  linear  operator  from  Q  to  Q.  Then 

II  (Tpi  +  A) 2  (Tp/  +  An)  1(Tp/  +  An)2 1| c(Q) 

[(Tp/  +  An)  5(Tp/  +  Xn  —  Tpi  —  Xn  +  Tp/  +  An)(Tp/  +  An)  2]  1 

[(Tp/  +  An)  2  (Tp/  —  Tp/)(Tp/  +  Xn)  2  +  /]  1 

We  dehne  Ai  as  follows: 


AG) 


AG) 


(23) 


\-i 


1 

<  - 
AG)  2 


(Tp/  —  Tpi)  (Tpi  +  Xn) 

Caponnetto  and  de  Vito  (2007)  showed  that  under  the  event  A±, 

(Tpi  +  An)  5(Tp/  —  Tpi)(Tpi  +  An)  2 


1 

AG)  ~  2’ 


and  the  probability  of  Ai  is  at  least  1  —  g/ 6  under  the  condition  Eq.(13).  Therefore  we 
obtain 

II  (Tp/  +  A  n)2(Tpi  +  An)  1(Tp'  +  An)2  ||  C{Q) 

[(Tpi  +  An)  5  (Tpi  —  Tp/)(Tp/  +  An)  5  +  I]  1 


AG) 


<  2 


(24) 


on  the  event  A\. 


Bounding  (1-a): 

{hi  —  hi,  (Tpi  +  An)  x(/i2  —  h2)) 

<  {hi  —  hi,  (Tpi  +  An)  2  [(Tpi  +  An) 2  (Tpi  +  An)  1(Tpi  +  Xn)2](Tpi  +  Xn)  2(h2  —  h2)) 


< 


(Tpi  +  An)  2  (hi  —  hi)  (Tpi  +  An) 2  (Tpi  +  An)  (Tpi  +  An) 2 

G 


AG) 


X 


(Tp/  +  An)  2  (h2  —  h2) 

According  to  Eq.(24),  we  have 

(Tpi  +  An)2  (Tp/  +  An)  1(Tpi  +  An)2 

on  the  event  Ai. 

Let  ^  — )■  Q  be  the  random  variable 


AG) 


<  2 


£(x')  —  (Tp1  +  An)  2Kxi. 
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Then 


(Tp,  +  \n)-l(hl-hl)  =  {P'n-P')i 

llflle  =  sJK'(TP'  +  A n)~lKx,  <  VA?, 
Ea:'~p'[||£||<j]  =  Ex^p,[K°(Tpl  +  A  „)"1AV] 

oo 


E 


A: 

Tk 


— /  Tk  +  A, 


-ek{x') 


k=1  Lk  +  An 

=  n). 

Therefore,  by  Proposition  1,  we  have 

II  (Tp  +  \n)~kh 1  -  ^)||0  =  II (i*  -  P'jeilo  <  21og(12/77) 


+ 


A/"(A  r 


n\fXn  V  n 

with  probability  1  —  77/6.  We  define  AI2  as  the  event  where  the  above  inequality  holds: 

2 


(25) 


A2  :=  \  || (Tp>  +  An)  2(^i  -  ^i)||o  <  2  log(12/ 77) 


\/A n 


+ 


n 


n 


(26) 


One  can  obtain  a  similar  bound  for  ||(Tp/  +  An)  2  [h2  —  h2)\\g.  In  fact,  using 


LX'r^p 


E 

.k= 1 


fJ'k 


Tk  +  A, 


-efe(cc 


/\2 


E  Ea;'~p' 


9*  ME 


—  Ib^lleEa;/^/ 


k 

00 


~ Jv  Tk  +  A, 


-ek(x 


i\ 2 


E 


Tk  _  /  /\2 


/ifc  +  Ai 

instead  of  Eq.(25),  one  can  show  that,  by  Proposition  1, 

||  (Tp>  +  A n)~^(h2  -h2)\\g<2  log(12 /-q)  ' 


-ejfe(a: )' 


=  lble^(An),  (27) 


+ 


lblAA(Ar 


V  n 

with  probability  1  —  77/6.  We  define  AI3  as  the  event  where  the  above  inequality  holds: 

2 


(28) 


A3  :=  {  II  (Tp>  +  An)  2  {h2  ~  h2)\\g  <2  \og(12/r)) 


\\g*m\r 


Tly/Xn  V  n 

Combining  Eqs.(24),  (25),  and  (28),  we  can  show  that  the  term  (a)  is  bounded  as 


(29) 


\{hi  -  hi,  ( Tp .  +  A„)  (h2  -  h2))\  <  16  log(12/?;)‘ 


4  +vw\m\r 


n2\ 


n 


(30) 


on  the  event  A\  0  A2  D  AI3. 
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Bounding  (1-b): 

(h±  —  h\,  (' Tpi  +  An)  l(Tpi  —  Tp/)(Tp>  +  Xn)  1/i2) 

<  \\(Tp,  +  Xn)  2  (hi  —  hi)\\g\\(Tp/ +  \n)2  (Tpi  +  \n)  1(Tpi  +  Xn) 2 \\c(g) 
x  I \{Tp'  +  An)  2(Tpt  —  Tpi ) ( Tpf  +  \n)  1h2\\g. 

We  have  already  obtained  bounds  for  \\(Tp/  +  X)~^ (hi  —  /ii)||  and  || (Tp/  +  A)^ (Tp/  +  A)_1(Tp/  + 
A)  ^  ||  in  Eq.(25)  and  Eq.(24): 

II P>  +  -  &0llo  <  2108(12/,)  (~T=  +  ijLM j  ,  (31) 

||(Tp/  +  Xn)2(Tpf  +  An)  1(Tp/  +  An)5||£^)  <  2,  (32) 

on  the  event  Ai  fl  A2. 

Let  £  :  — >■  Q  be  the  random  variable  such  as 

f  (*)  =  (Tp,  +  Xn)-*KxK°(Tp,  +  X  n)~lh2. 

Then  we  have 

Iie(*)||e  =  II  {Tp,  +  A  n)-^KxKl(Tp  +  Xnj^Tpi  g*  \\g 

<  II (Tp  +  An)-^||£(g)|| Kx K° 1 1 £(g} 1 1 (Tp/  +  Xn)-%,\\c{g)\\g*\\g 

<  ^ 2 \\9*\\g, 

where  we  used  the  relation 


W(KxK°)h\\g  =  \\KX(KX,  h)g\\g  =  (Kx,  h) g\\Kx\\g 
<\\h\\o\\Kw\\2g  =  \\h\\gK(x,x)<\\h\\g 

for  all  h  G  Q.  Then, 

E^tll^SBOIIe]  =  E^^[||(Tp,  +  A  n)--2Kx,K°,(Tp/  +  X^T^g*^} 

<  [|| (Tp'  +  Xn)-lKx,Kl,\\c{g) \  \\K°x,Kx,\\c{g)\\(Tp,  +  A,,)"1^*^ 

<  E ^  [tr  (C Tp,  +  XnylKx,K°x,)\  \\g% 

=  tr  {(Tp  +  X^-%,)  \\g% 

=  ^(An)|klJ. 


Therefore,  by  Proposition  1,  we  obtain 

||(Tp/  +  An)  2 (Tpr  —  Tpi) (Tpi  +  Xn)  1/i2||g 


<  21og(12/77) 


ny/Yn 


+ 


(33) 


n 
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with  probability  1  —  77/6.  We  define  A4  as  the  event  where  the  above  inequality  holds 


A4  :  — <  || (Tpi  +  Xn)  2{Tpt  —  Tpi)(Tpi  +  An)  h2\\g 


<2iog(i2/7?)  1 


n 


\/  Xn 


n 


Combining  Eqs.(25),  (24),  and  (33),  the  term  (1-b)  is  bounded  as 

| (hi  —  hi,  ( Tpi  +  \n)  1(Tpi  —  Tp/)(Tpi  +  Xn)  1h2)| 


<  16  log(12  /  ?7)s 


n2  A, 


+ 


\\9*mxr 


n 


on  the  event  A\  fl  A2  D  A4. 

Bounding  (1-c):  We  have 

(hi  —  hi,  (Tpi  +  An)  lXng*)  =  {(Tpi  +  An)  2  (hi  —  hi),  ( Tpi  +  An)  2  A ng*) 


—  II  (Tp'  +  An)  2  (hi  —  hi)  ||g||  (Tpi  +  Xn)  2  \/Xrig*\\g\/Xn- 


(34) 


Notice  that  Eq.(25)  gives 


||(Tp/  +  An)  2  (hi  —  hi)||g  <  2  log(12/ 77)  |  — 7=  +  ^ 


n 


\fKi 


n 


(35) 


on  the  event  *42.  This  and  ||(Tp/  +  An)  2\/\lg*\\g  <  ||g*||g  give 
|(hi  -  hi,  (Tpi  +  An)-1An(?*)|  <  2 log(12/ 77)  (  +  ll/ll 


M(Xn)Xr 


n 


(36) 


on  the  event  Al2. 
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Combining  the  bounds  of  (1-a),  (1-b),  and  (1-c): 

\(P'n-P')(g-g*)\ 


<  161og(12 /r]f 

(n2A„  1 

+  21og(12 /rj)  j 

V  »  ' 

=  161og(12/7?)2 

(  8 

\n2 Xn  4 

+  21og(12 /rj)  j 

,  n 

n 


n2  Xr 


n  J 


+  11 9*\\g 


A/”(An)Ar 


n 


8  ,  (l|g*l|g+v/HgigMA„)'i 

n  J 


+  m 


A/”(An)Ar 


n 


on  the  event  A\  D  A2  D  A3  Pi  A4. 

Step  2.  Bounding  \(Pn  -  P)(g  -  g*)\ 

As  in  Eq.(22),  we  have 

(Pn-P)(g~g*) 

—  (h2  —  h2,  (Tpi  +  An)  l(h2  —  h2))  +  (h2  —  h2,  (Tpi  +  An)  1(TP>  —  Tpi)(Tpi  +  Xn)  1  ^2) 
+  (h2  —  h2,  (Tpi  +  Xn)  1A ng*). 

Using  Eq.(28)  instead  of  Eq.(25),  on  the  event  A\  fl  A3  D  A4,  each  term  is  bounded  as 
\(h2  —  h2,  (Tpi  +  A„)  l(h2  —  h2))\ 


1(^2  —  h2,  (Tpi  +  Xn)  1(TP,  —  Tpi) (Tpi  +  An)  ^h2) | 
<  16  log(12/?7)s 


4  |  Ilglll^An)^ 


1(^2  h2,  (Tpi  -f-  An)  Ag*)| 


n2Xn  n 

-1  \  „* 


’ 


<21og(12/7?) 


2||s*||e  ,  11  *nl .  /A/’(An)A 


n 


+  \\9*U 


n 
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Then  we  obtain  the  following  bound: 

\(Pn-  p)(g  -  g*)\ 

<  161og(12 /rj) 


21  8  ,  (\\gTg  +  h*\\sW^r 


n2  Xr 


+ 


n 


21og(12 /,)  I  2^k  +  \\g-\\l-<M(K) 


n 


n 


on  the  event  A\  D  Az  fl  A4. 

Step  3.  Bounding  | P'ng  —  1| 

We  decompose  g  as 


g  =  u  +  f3, 


(37) 


where  u  T  1  in  Q  and  /3  is  a  constant  function.  Then  one  can  easily  show  that 

1  -  P'S 


P  = 


1  +  An||l||g 


Therefore 


Kg  p>+i  +  An|fi||e  1  +  A^ 


n  ii2 


1  +  A„||l||g 


(38) 


If  we  can  show  that  u  is  bounded  (i.e. ,  Op(  1)),  then  P'ng  —  1  =  Op( Am).  To  show  that,  we 
bound  II oil  because 


\u\ 


<  \\u\\g  <  v\\u\\g  +  \\P\\g  =  \\g\\g- 


We  have 


g  —  (^2)  (Tp>  +  Xn)  2h2) 

—  +  An)  1h2,  [( Tpi  +  A n)(Tpi  +  An)  2(TP /  +  An)](Tp/  +  An)  lh2) 


Here 


{Tp,  +  \n)(Tpi  +  An)_1  =  (/  -  (Tpi  -  Tpt){Tp!  +  An)-1)-1 
and  on  the  event  A\  with  the  condition  Eq.(13),  we  have 


11(2^  —  Tp>)(Tjy  +  A„)  || £(S)  <  2' 


(39) 
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Hence 


II (Tpi  +  X n){Tpt  +  Xn)  || c(g)  <  2 

on  the  event  A\  with  the  condition  Eq.(13). 

We  have  that 

||  (Tp,  +  \n)-\h2  -  h2)\\g  <  XuHiTp  +  A  n)-t(h2  -  h2)\\g 


<21og(12/77) 


A  nn 


+ 


A  nn 


on  the  event  ^.3.  Hence  Eqs.(40)  and  (41)  and 

||(7>  +  A„)-1ft2||  =  ||(TP,  +  A„)'1TP'9*II  <  lls’llc 


(40) 


I  ‘ 


(41) 


give 


llsll2  <  8  (|| 9*115  +  81og2(12/,)  ,  (42) 

on  the  event  *4.3. 

Therefore,  Eqs.(38)  and  (42)  give 


-  A”T+xfljf  {8  (ll9,|l«  +  81o«2(12/9) 

_ \  n 

•  /'nK-yn,rj  5 


4  ^(Aw)||g*||g\\ 

A In2  A nn  ) ) 


+  1 


(43) 


011  the  event  „4.3. 


Step  4.  Bounding  P'(g  —  g *)2 

Decompose  g  —  g*  as 

g  -  g*  =  (g  -  g\n)  +  (g\n  -  g*)- 

The  Erst  term  is  evaluated  as  follows: 

9  ~  gxn  =  (^p'  +  Xn)  1h2  —  ( Tp /  +  An)  1h2 

=  (Tp/  +  An)  1  | {h2  —  h2)  +  (Tp/  —  Tp/)(Tp/  +  An)  1h2^  . 


(44) 
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Thus 


P'(g  ~  9\n)2  —  VTp'iTp’  +  An) 


\-i 


^2)  +  (Tp?  —  Tpi)(Tpi  +  A„)  1h2 


2 

g 


<  2 


\\^(Tpl  +  Xn)-\h2-h2)\\2g 

' - - - V - - 

(4-a) 


+  llv^(^  +  An)  1(Tp/  —  Tpi)(Tp/  +  An)  1h 


2  lie 


(4-b) 


(45) 


Bounding  (4-a):  We  have 

\\^(fp,  +  Xn)-l(h2-h2)\\g 

—  II y/Pp'i'Pp’  +  An)  2 ||£(6)||(^p' +  An)2 (Tp' +  An)  1(Tp/ +  An) 2 ||e 

X  ||(Tp,  +  An)  2  (h2  /12) ||g. 


It  is  obvious  that 

W\/Tp'(Tp'  +  An)  2 ||g  <  1- 

(46) 

By  Eq.(24), 

||  {Tpi  +  An)  2  (Tpi  +  An)  l{Tpt  +  An)2  | \g  <  2 

(47) 

on  the  event  *4.i.  By  Eq.(28), 

I  ( Tpi  +  An) 

-=(fc2  -  Mis  <  21og(12/,)  (  n^.  +  V/"9 

(48) 

on  the  event  *4.3. 

Combining  Eqs.(46),  (24),  and  (28),  we  have 

II  V%'(L  +  A n)~\h2  -  h2)\\s  <  4108(12/,)  (~T=  +  yiZMMj  ,  (49) 


on  the  event  *4.i  D  *A3. 


Bounding  (4-b):  We  have 

||  y/Tp> (Tpi  +  An)  l{Tpi  —  Tpi)(Tpi  +  An)  ||e 

—  II \/Tp'{Tp'  +  An)  2 IU(G)|| (Tp>  +  An) 2 {Tpi  +  An)  1  (Tp/  +  An) 2 ||,C(G) 
X  II (Tp,  + An)  2  (Tpi  —  Tp>) (Tpi  +  An)  x^2 lie- 
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By  Eq.(33),  we  have 


||(Tp/  +  An)  2  (Tpt  —  Tp/)(Tp/  +  An)  h2\\g  <  21og(12/^) 


P  A  n 


+ 


n 


\\9*\\2gM(K 


n 


on  the  event  A4.  Thus  Eqs.(46),  (24),  and  (33)  indicate 

W\/Tp{Tpr  +  Xn)  1(Tpr  —  Tpi){Tpr  +  An)  1h2||g 
<  41og(12/??) 


n 


h*\\gN{^r 


n 


(50) 


on  the  event  A\  D  A4. 


Combining  the  bounds  of  (4-a)  and  (4-b):  Substituting  Eqs.(50)  and  (49)  to 
Eq.(45),  we  have 

4  ,  \\g%^(K) 


<64  j  log2(12/,,)  f-T-  + 


n 


log2(12 /rj) 


4  .  \\9*h^(K) 


n2  Ar 


+ 


n 


(51) 


on  the  event  A\  D  A4. 

On  the  other  hand,  P'(g\n  —  g*)2  is  bounded  as 


*  A  1 1 2 

Q 


P'(g\n  -  g*)2  =  II  \/%'{(Tp,  +  A „)-%  -  g*)\\g  =  ||  V%'(Tv'  +  K)~\h2  -  (Tp/  +  A n)g 

=  \W%'(TP,  +  \nyl\ng*)Wl  <  | |(Tp,  +  \n)-^Kg*)Wl  <  K\\g%-  (52) 

By  Eqs.(51)  and  (52),  P'(g  —  g*)2  is  bounded  as 

P'(g-g*)2 

<2(P,(g-gxn)2  +  P,(gxn-g*)2) 


<  128  log'q  12/77) 


2/n  I  8  ,  (ll^lle  +  ll^lleMAn)^  ,  ox  „„*„2 


n2  Xr 


+ 


n 


+  2An||(?*||g, 


on  the  event  A\  D  A4. 


Step  5.  Bounding  | (P'n  —  P')(g*  —  1)|  and  | (Pn  —  P){g*  —  1)| 

By  Proposition  1,  we  have  the  following  bound 

\(K  -  P')(g *  - 1)1  <  2tog(i2/,)  (2||g,~1|l°°  +  ]/P'is’~ 1)2)  .  (53) 
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with  probability  1  —  77/6.  Similarly  we  have 


|(P„  -  P)(a '  - 1)1  <  2bg(i2/„)  |  2||g*  1|U  +  ■  'P(9‘ " 


n 


n 


(54) 


with  probability  1  —  77/6. 

We  define  A$  and  A$  as  the  events  where  the  above  inequalities  hols: 


A  •=  (  \(K  -  P')(9*  -  1)1  <  2tag(12/„)  |  2|lg*  1|lo°  +  •  'P'(9‘  " 


77 


n 


A  :=  (  I  (Pn  -  P)(g'  -  1)1  <  2bg(12/„)  |  2|lg*  1|lo°  +  ■  'P(9‘  1)2 


77 


n 


(55) 

(56) 


Step  6.  Combining  the  bounds  of  Step  1  to  5. 

Finally  we  obtain 

|PE(*,*')  -  PE(P,  P')\ 


<  81og(12/77 Y 


8  ,  m\g+vwhmK) 


n2Xr 


+ 


n 


108(12/,)  |  +  ||9. 


XnAf(Xr 


n 


8  log(12/ 77) 


21  §  ,  (\\g*U  +  \\g*\\gW(K 


n2Xr 


+ 


77 


108(12/,)  I  +  ||9 


77 


*  II  2 
Q 


XnJ\f(Xr 


77 


+  -A  C 

1  ^  /Xrn\J  n^i 


log(12/7/) 


[f 


-  II 


77 


+ 


P'fa*  -  l)2  P(g*  -  l)2 


+ 


77 


77 


+  \^P’(9'  ~  l)2  Jl281og2(12/,)  ( 


8  +»dk±MIMM1+2A„l|9.||,, 


772A 


77 


on  the  event  (^7=1  ^  the  probability  of  which  is  at  least  1  —  77. 

Proof  of  Theorem  1’.  By  Proposition  3  in  Caponnetto  and  de  Vito  (2007),  we  obtain 

■V(A)  <  r— — A“2, 

2-7 


□ 
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where  c  is  the  constant  appears  in  the  assumption.  Then  substituting  the  above  inequality 
2 

and  Xn  =  (^p)2"1"7  to  Eq.(14),  we  can  see  that  there  is  a  constant  K  depending  on 
c,  7)  IMIs  such  that 

\FE{X,X')  —  PE(P,  P')\ 

<  /i  |  (log(12/>j)2  +  log(12/>j)  +  1)  f'f'j 

+ MW’))  1)2  + 

+  vna-  -  i)V(iog(i2/i;)2  +  1)  |,  (57) 

with  probability  at  least  1  —  r]  under  the  condition  Eq.(13).  The  condition  Eq.(13)  is 
satisfied  for  sufficiently  large  n.  Therefore  Eq.(57)  implies  that 

|PE(V  X’)  -  PE(P,  P)|  =  op  f  (pp) 2+7  +  Vng' -  I)2  (pp) 2+2 \  ■ 

□ 


B  Proof  of  Theorem  2 


In  this  section,  we  prove  Theorem  2.  Here,  for  being  more  precise,  we  rewrite  Theorem  2 
as  follow. 

Theorem  2’.  Let  Fn(- \X  U  X')  be  the  distribution  function  of  PE(T,  X')  given  X  U  X' . 
Let 

q(X  U  X')  =  sup  {a;  G  M  |  Fn(x\X  U  X')  <  1  —  a} 
be  the  upper  lOOa-percentile  point.  Then,  if  the  null  hypothesis  is  true  (i.e.,  P  =  P'), 

Prob  (PE(T,  X')  >  q{X  U  *'))  <  a. 


Proof.  Since  the  samples  {aq}P=1  and  {cc'}”=1  are  distributed  i.i.d.  and  P  =  P' ,  they 
are  exchangeable,  i.e.,  the  distribution  of  (y1, . . . ,  y2n )  =  (aq, . . . ,  xn,  x[, . . . ,  x'n)  is  same 
as  that  of  (yTpp  •  •  • ,  yT{2n))  f°r  any  permutation  r  on  {1, . . . ,  2n}.  This  means  that  the 
distribution  function  Fn(-  \  S )  is  the  same  as  that  of  PE(T,  X')  conditioned  on  S  =  XUX'. 
Then,  we  have 


Prob  (PE(T,  X')  >  q(X  U  X') )  =  EXuX , 


=  E 


XUX' 


Prob  (PE{X,  X’)  >  q(X  U  X’)  \  X  U  X' 
1  -  Fn  (q(X  UX')\XU  X') 
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which  concludes  the  proof. 


□ 
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Abstract 

Estimation  of  the  ratio  of  probability  densities  has  attracted  a  great  deal  of  attention 
since  it  can  be  used  for  addressing  various  statistical  paradigms.  A  naive  approach  to 
density-ratio  approximation  is  to  first  estimate  numerator  and  denominator  densities 
separately  and  then  take  their  ratio.  However,  this  two-step  approach  does  not 
perform  well  in  practice,  and  methods  for  directly  estimating  density  ratios  without 
density  estimation  have  been  explored.  In  this  paper,  we  first  give  a  comprehensive 
review  of  existing  density-ratio  estimation  methods  and  discuss  their  pros  and  cons. 
Then  we  propose  a  new  framework  of  density-ratio  estimation  in  which  a  density- 
ratio  model  is  fitted  to  the  true  density-ratio  under  the  Bregman  divergence.  Our 
new  framework  includes  existing  approaches  as  special  cases,  and  is  substantially 
more  general.  Finally,  we  develop  a  robust  density-ratio  estimation  method  under 
the  power  divergence,  which  is  a  novel  instance  in  our  framework. 

Keywords 

Density  ratio,  Bregman  divergence,  Logistic  regression,  Kernel  mean  matching, 
Kullback-Leibler  importance  estimation  procedure,  Least-squares  importance  fit¬ 
ting 
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1  Introduction 

The  ratio  of  probability  densities  can  be  used  for  various  statistical  data  processing  pur¬ 
poses  (Sugiyama  et  al.,  2009,  2012)  such  as  discriminant  analysis  (Silverman,  1978),  non- 
stationarity  adaptation  (Shimodaira,  2000;  Sugiyama  and  Muller,  2005;  Sugiyama  et  ah, 
2007;  Quinonero-Candela  et  ah,  2009;  Sugiyama  and  Kawanabe,  2011),  multi-task  learn¬ 
ing  (Bickel  et  ah,  2008),  outlier  detection  (Hido  et  ah,  2008;  Smola  et  ah,  2009;  Hido  et  ah, 
2011),  two-sample  test  (Keziou  and  Leoni-Aubin,  2005;  Sugiyama  et  ah,  2011a)  change 
detection  in  time  series  (Kawahara  and  Sugiyama,  2009),  conditional  density  estimation 
(Sugiyama  et  ah,  2010),  and  probabilistic  classification  (Sugiyama,  2010). 

Furthermore,  mutual  information — which  plays  a  central  role  in  information  theory 
(Cover  and  Thomas,  2006) — can  be  estimated  via  density-ratio  estimation  (Suzuki  et  ah, 
2008,  2009b).  Since  mutual  information  is  a  measure  of  statistical  independence  between 
random  variables,  density-ratio  estimation  can  be  used  also  for  variable  selection  (Suzuki 
et  ah,  2009a),  dimensionality  reduction  (Suzuki  and  Sugiyama,  2010),  independent  com¬ 
ponent  analysis  (Suzuki  and  Sugiyama,  2009),  causal  inference  (Yamada  and  Sugiyama, 
2010),  clustering  (Kimura  and  Sugiyama,  2011),  and  cross-domain  object  matching  (Ya¬ 
mada  and  Sugiyama,  2011)  Thus,  density-ratio  estimation  is  a  versatile  tool  for  statistical 
data  processing. 

A  naive  approach  to  approximating  a  density-ratio  is  to  separately  estimate  the  two 
densities  corresponding  to  the  numerator  and  denominator  of  the  ratio,  and  then  take 
the  ratio  of  the  estimated  densities.  However,  this  naive  approach  is  not  reliable  in  high¬ 
dimensional  problems  since  division  by  an  estimated  quantity  can  magnify  the  estimation 
error  of  the  dividend.  To  overcome  this  drawback,  various  approaches  to  directly  estimat¬ 
ing  density-ratios  without  going  through  density  estimation  have  been  explored  recently, 
including  the  moment  matching  approach  (Gretton  et  ah,  2009),  the  probabilistic  clas¬ 
sification  approach  (Qin,  1998;  Cheng  and  Chu,  2004),  the  density  matching  approach 
(Sugiyama  et  ah,  2008;  Tsuboi  et  ah,  2009;  Yamada  and  Sugiyama,  2009;  Nguyen  et  ah, 
2010;  Yamada  et  ah,  2010),  and  the  density-ratio  fitting  approach  (Kanamori  et  ah,  2009). 

The  purpose  of  this  paper  is  to  provide  a  general  framework  of  density-ratio  estimation 
that  accommodates  the  above  methods.  More  specifically,  we  propose  a  new  density- 
ratio  estimation  approach  called  density-ratio  matching — a  density-ratio  model  is  fitted 
to  the  true  density-ratio  function  under  the  Bregman  divergence  (Bregman,  1967).  We 
further  develop  a  robust  density-ratio  estimation  method  under  the  power  divergence 
(Basu  et  ah,  1998),  which  is  a  novel  instance  in  our  general  framework.  Note  that  the 
Bregman  divergence  has  been  widely  used  in  machine  learning  literature  so  far  (Collins 
et  ah,  2002;  Murata  et  ah,  2004;  Tsuda  et  ah,  2005;  Dhillon  and  Sra,  2006;  Cayton, 
2008;  Wu  et  ah,  2009),  and  the  current  paper  explores  a  new  application  of  the  Bregman 
divergence  in  the  framework  of  density-ratio  estimation. 

The  rest  of  this  paper  is  organized  as  follows.  After  the  problem  formulation  below,  we 
give  a  comprehensive  review  of  density-ratio  estimation  methods  in  Section  2.  In  Section  3, 
we  describe  our  new  framework  for  density-ratio  estimation.  Finally,  we  conclude  in 
Section  4. 
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Problem  Formulation:  The  problem  of  density-ratio  estimation  addressed  in  this  pa¬ 
per  is  formulated  as  follows.  Let  X  (c  Md)  be  the  data  domain,  and  suppose  we  are  given 
independent  and  identically  distributed  (i.i.d.)  samples  from  a  distribution  with 

density  p*n u(x)  defined  on  X  and  i.i.d.  samples  from  another  distribution  with 

density  pde(x)  defined  on  X. 


{*r}?=i  ~d •  K u(*)  and  *&,(*) 


We  assume  that  pde(x)  is  strictly  positive  over  the  domain  X.  The  goal  is  to  estimate  the 
density-ratio, 


r  (x) 


PnuO) 

P*de(X)  ’ 


from  samples  and  {x^e}^x .  ‘nu’  and  ‘de’  indicate  ‘numerator’  and  ‘denominator’, 

respectively. 


2  Existing  Density-Ratio  Estimation  Methods 

In  this  section,  we  give  a  comprehensive  review  of  existing  density-ratio  estimation  meth¬ 
ods. 

2.1  Moment  Matching 

Here,  we  describe  a  framework  of  density-ratio  estimation  based  on  moment  matching. 

2.1.1  Finite-Order  Approach 

First,  we  describe  methods  of  hnite-oder  moment-matching  for  density-ratio  estimation. 

The  simplest  implementation  of  moment  matching  would  be  to  match  the  first-order 
moment  (i.e.,  the  mean): 


argnnn 


xr(x)p*de(x)dx  -  /  xp*nu(x)dx 


where  ||  ■  ||  denotes  the  Euclidean  norm.  Its  non-linear  variant  can  be  obtained  using  some 
non-linear  function  <p(x)  :  Wl  — >  Id  as 


argmin  MMy(r), 

r 


where 


MM'(r)  := 


<t>(x)r(x)p*de(x) dx-  /  4>{x)p*nu(x)dx 
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‘MM’  stands  for  ‘moment  matching’.  Let  us  ignore  the  irrelevant  constant  in  MM'(r)  and 
define  the  rest  as  MM(r): 


MM (r)  :  = 


c p(x)r(x)pde(x)dx 


4><yx)r(x)p*Ae(x)dx ,  /  c/)(x)p*nu(x) dx)  , 


(1) 


where  (•,  •)  denotes  the  inner  product. 

In  practice,  the  expectations  over  p^u{x)  and  pde(x)  in  MM(r)  are  replaced  by  sample 
averages.  That  is,  for  an  ride-dimensional  vector 

**de  :=  (r*«),...,r*«J)T, 

where  T  denotes  the  transpose,  an  estimator  rde  of  rAe  can  be  obtained  by  solving  the 
following  optimization  problem. 

rde  :=  argmin  MM(r),  (2) 

7*ERnde 


where 

MM(r)  :=  -b>-T*L*d,r  -  — (3) 
nde  ndennu 

ln  denotes  the  n-dimensional  vector  with  all  ones.  d>nu  and  <E»de  are  the  t  x  nnu  and  t  x  ride 
design  matrices  defined  by 


*n»:=(0(*r),...,0«J)  and  *d.~(0«),...  ,</>«.)), 

respectively.  Taking  the  derivative  of  the  objective  function  (3)  with  respect  to  r  and 
setting  it  to  zero,  we  have 

2  2 

~ ^^de^de^  ^de^nulnnu  =  Of, 

nde  nden  nu 

where  denotes  the  t-dimensional  vector  with  all  zeros.  Solving  this  equation  with 
respect  to  r,  one  can  obtain  the  solution  analytically  as 

?de  =  — 

^nu 

One  may  add  a  normalization  constraint 


to  the  optimization  problem  (2).  Then  the  optimization  problem  becomes  a  convex 
linearly- constrained  quadratic  program.  Since  there  is  no  known  method  for  obtaining  the 
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analytic-form  solution  for  convex  linearly-constrained  quadratic  programs,  a  numerical 
solver  may  be  needed  to  compute  the  solution.  Furthermore,  a  non- negativity  constraint 


r  >  0-de 

and/or  an  upper  bound  for  a  positive  constant  B ,  i.e., 

r  <  Blnde 

may  also  be  incorporated  in  the  optimization  problem  (2),  where  inequalities  for  vectors 
are  applied  in  the  element-wise  manner.  Even  with  these  modifications,  the  optimization 
problem  is  still  a  convex  linearly-constrained  quadratic  program,  so  its  solution  can  be 
numerically  computed  by  standard  optimization  software. 

The  above  fixed-design  method  gives  estimates  of  the  density-ratio  values  only  at  the 
denominator  sample  points  Below,  we  consider  the  induction  setup,  where  the 

entire  density-ratio  function  r*(x)  is  estimated  (Qin,  1998;  Kanamori  et  ah,  2012). 

We  use  the  following  linear  density-ratio  model  for  density-ratio  function  learning: 


r(x )  =  YM&)  = 


(4) 


i= i 


where  ifi{x)  : 


->■ 


is  a  basis  function  vector  and  G  (e  M.b )  is  a  parameter  vector.  We 


assume  that  the  basis  functions  are  non- negative. 

>  0fe. 

Then  model  outputs  at  {x^eYj=\  are  expressed  in  terms  of  the  parameter  vector  G  as 


where  Shde  is  the  b  x  n&e  design  matrix  defined  by 

T'de  :=  (^(a:f ),..., 7/j(xJe))- 

Then,  following  Eq.(2),  the  parameter  G  is  learned  as  follows. 


G  :=  argmin 


OcRb  Lnde 


^•de^nu 


G  'S’  (  $  1 

w  ^  dc  ^  de  ^  nu  nn 


(5) 


(6) 


Taking  the  derivative  of  the  above  objective  function  with  respect  to  G  and  setting  it  to 
zero,  we  have  the  solution  G  analytically  as 


^-de 


G  =  —  (^de^Je^de^dJ  ^dc^de^nul 


T\  — 1, 


.T, 


n 
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One  may  include  a  normalization  constraint,  a  non-negativity  constraint  (given  that  the 
basis  functions  are  non- negative),  and  a  regularization  constraint  to  the  optimization 
problem  (6): 


—  Iie*le0  =  1,  0  >  0b,  and  e  <  Blh. 

Tide 

Then  the  optimization  problem  becomes  a  convex  linearly-constrained  quadratic  program, 
whose  solution  can  be  obtained  by  a  standard  numerical  solver. 

The  upper-bound  parameter  B,  which  works  as  a  regularizer,  may  be  optimized  by 
cross-validation  (CV)  with  respect  to  the  moment-matching  error  MM  defined  by  Eq.(l). 
Availability  of  CV  would  be  one  of  the  advantages  of  the  inductive  method  (i.e. ,  learning 
the  entire  density-ratio  function). 

2.1.2  Infinite-Order  Approach:  KMM 

Matching  a  finite  number  of  moments  does  not  necessarily  lead  to  the  true  density-ratio 
function  r*(x),  even  if  infinitely  many  samples  are  available.  In  order  to  guarantee  that  the 
true  density-ratio  function  can  always  be  obtained  in  the  large-sample  limit,  all  moments 
up  to  the  infinite  order  need  to  be  matched.  Here  we  describe  a  method  of  infinite-oder 
moment-matching  called  kernel  mean  matching  (KMM),  which  allows  one  to  efficiently 
match  all  the  moments  using  kernel  functions  (Huang  et  ah,  2007;  Gretton  et  ah,  2009). 

The  basic  idea  of  KMM  is  essentially  the  same  as  the  finite-order  approach,  but  a 
universal  reproducing  kernel  K(x,x')  (Steinwart,  2001)  is  used  as  a  non-linear  transfor¬ 
mation.  The  Gaussian  kernel 

K{x,x')  =  exp  )  (7) 

is  an  example  of  universal  reproducing  kernels.  It  has  been  shown  that  the  solution  of 
the  following  optimization  problem  agrees  with  the  true  density-ratio  (Huang  et  ah,  2007; 
Gretton  et  ah,  2009): 


mm 

r^TL 


K(xr)p*nu(x)dx  -  /  K(x,-)r(x)p*Ae(x)dx 


H 


where  hi  denotes  a  universal  reproducing  kernel  Hilbert  space  and  ||  •  ||^  denotes  its  norm. 
An  empirical  version  of  the  above  problem  is  expressed  as 


mm 

r£Rnde 


9  x  -K-  de  AeV 


-r  K 


de,nu 


_Tlc Je  ^de^nu 

where  Kdede  and  K dG)nu  denote  the  kernel  Gram  matrices  defined  by 

[Kde>de]iy  =  K{xf,xf)  and  [Kde,nu]yi  =  K{xf ,  <u), 


(8) 
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respectively.  In  the  same  way  as  the  finite-order  case,  the  solution  can  be  obtained 
analytically  as 

?de  =  ^K~ldeKdetnulnnu.  (9) 

“'nu 

If  necessary,  one  may  include  a  non- negativity  constraint,  a  normalization  constraint, 
and  an  upper  bound  in  the  same  way  as  the  finite-order  case.  Then  the  solution  can 
be  numerically  obtained  by  solving  a  convex  linearly-constrained  quadratic  programming 
problem. 

For  a  linear  density-ratio  model  (4),  an  inductive  variant  of  KMM  is  formulated  as 


min 

0e  R6 


1  2 

-^GT^dcKdeAe^JcG - GT*deKde,nu  lni 

nde  ndennu 


and  the  solution  G  is  given  by 


6 


— —  (^rdc-K’de,dc^rde)  ^dc-^de.nuln, 


2.1.3  Remarks 

The  infinite-order  moment  matching  method,  kernel  mean  matching  (KMM),  can  effi¬ 
ciently  match  all  the  moments  by  making  use  of  universal  reproducing  kernels.  Indeed, 
KMM  has  an  excellent  theoretical  property  that  it  is  consistent  (Huang  et  ah,  2007;  Gret- 
ton  et  ah,  2009).  However,  KMM  has  a  limitation  in  model  selection — there  is  no  known 
method  for  determining  the  kernel  parameter  (i.e. ,  the  Gaussian  kernel  width).  A  popular 
heuristic  of  setting  the  Gaussian  width  to  the  median  distance  between  samples  (Scholkopf 
and  Smola,  2002)  would  be  useful  in  some  cases,  but  this  may  not  always  be  reasonable. 

In  the  above,  moment  matching  was  performed  in  terms  of  the  squared  norm,  which 
led  to  an  analytic-form  solution  (if  no  constraint  is  imposed).  As  shown  in  Kanamori 
et  ah  (2012),  moment  matching  can  be  systematically  generalized  to  various  divergences. 

2.2  Probabilistic  Classification 

Here,  we  describe  a  framework  of  density-ratio  estimation  through  probabilistic  classifica¬ 
tion. 

2.2.1  Basic  Framework 

The  basic  idea  of  the  probabilistic  classification  approach  is  to  obtain  a  probabilistic 
classifier  that  separates  numerator  samples  and  denominator  samples 

Let  us  assign  a  label  y  —  + 1  to  and  y  =  —  1  to  {aj‘-e}"de1,  respectively.  Then 

the  two  densities  p*u(:r)  and  Pde(x )  are  written  as 


Pnu(*)  =P*(X\y  =  +1)  and  P*de(X)  =  P*{X\V  =  -!)> 
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respectively.  Note  that  y  is  regarded  as  a  random  variable  here.  An  application  of  Bayes’ 
theorem, 


p*(x\y) 


p*(y\x)p*(x) 

p*(y ) 


yields  that  the  density-ratio  r*(x)  can  be  expressed  in  terms  of  y  as  follows: 

r*(x)  =  =  ( P*(y  =  +1la;)P*OA  f  P*(v  =  -!|a;)p*(a;)>\  _1 

p*le(x)  V  p*(y  =  + 1)  /V  p*(y  =  - 1)  ) 

=  p*{y  =  -i  )p*(y  =  +i|g) 

p*(y  =  +1  )p*(y  =  — l|aj)’ 


The  ratio  p*(y  =  — 1  )/p*(y  =  +1)  may  be  simply  approximated  by  the  ratio  of  the  sample 
size: 

P*(y  =  -1)  _  nde/(nde  +  nnu)  _ 
p*(y  =  +1)  nnvL/(nde  +  nnu )  nnu' 


The  ‘class’-posterior  probability  p*(y\x)  may  be  approximated  by  separating  {x™}-Aj  and 
using  a  probabilistic  classiher.  Thus,  given  an  estimator  of  the  class-posterior 
probability,  p(y\x),  a  density-ratio  estimator  r(x)  can  be  constructed  as 


ridePjy  =  +l|ig) 
rinnP(y  =  -11*)’ 


(10) 


A  practical  advantage  of  the  probabilistic  classification  approach  would  be  its  easy 
implementability.  Indeed,  one  can  directly  use  standard  probabilistic  classification  algo¬ 
rithms  for  density-ratio  estimation.  Another,  more  important  advantage  of  the  proba¬ 
bilistic  classification  approach  is  that  model  selection  (i.e.,  tuning  the  basis  functions  and 
the  regularization  parameter)  is  possible  by  standard  cross-validation  since  the  estimation 
problem  involved  in  this  framework  is  a  standard  supervised  classification  problem. 

Below,  two  probabilistic  classification  algorithms  are  described.  For  making  the  expla¬ 
nation  simple,  we  consider  a  set  of  paired  samples  {(xk,yk)}^=1,  where,  for  n  =  nnu  +  nde, 


(*i,  ■  •  •  ,*n) 

(s/i,  ■  ■  ■  ,yn) 


11U  I1U 

J  *  *  *  ?  ^nnu’ 


de 
1  5 


(+1, . . . ,  +1,  —1, . . . ,  —1). 


- V - 

tlnu 


■" V" 

^■de 


2.2.2  Logistic  Regression 

Here,  a  popular  probabilistic  classification  algorithm  called  logistic  regression  (Hastie 
et  ah,  2001)  is  explained. 

A  logistic  regression  classiher  employs  a  parametric  model  of  the  following  form  for 
expressing  the  class-posterior  probability  p*(y\x), 

1 


p(y\x\Q) 


1  +  exp  (— y^(x)TG) ' 
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where  i^{x)  :  — >  M.b  is  a  basis  function  vector  and  6  (e  M6)  is  a  parameter  vector.  The 
parameter  vector  6  is  determined  so  that  the  penalized  log-likelihood  is  maximized,  which 
can  be  expressed  as  the  following  minimization  problem: 


e 


argmin 

0eK6 


log  (1  +  exp  (-ykfi}(xk)T  G))  +  A GT0 

.  k= 1 


(11) 


where  A 6  6  is  a  penalty  term  included  for  regularization  purposes. 

Since  the  objective  function  in  Eq.(ll)  is  convex,  the  global  optimal  solution  can  be 
obtained  by  a  standard  non-linear  optimization  technique  such  as  the  gradient  descent 
method  or  (quasi-) Newton  methods  (Hastie  et  al.,  2001;  Minka,  2007).  Finally,  a  density- 
ratio  estimator  f'LR.(cc)  is  given  by 


t’lr(-e) 


^-de 
n  nu 


1  +  exp  (^(x)T6^j 
1  +  exp  ^(x)T6^j 


—  exp  if>(x)T0 )  , 

V  A  / 


where  ‘LR’  stands  for  ‘logistic  regression’. 

Suppose  that  the  logistic  regression  model  p(y \x\  6)  satisfies  the  following  two  condi¬ 
tions: 


•  The  constant  function  is  included  in  the  basis  functions,  i.e. ,  there  exists  6°  such 
that 


i^(x)T  0°  =  1. 

•  The  model  is  correctly  specified,  i.e.,  there  exists  6*  such  that 

p(y\x-G*)  =p*(y\x). 

Then  it  was  proved  that  the  logistic  regression  approach  is  optimal  among  a  class  of 
semi-parametric  estimators  in  the  sense  that  the  asymptotic  variance  is  minimized  (Qin, 
1998).  However,  when  the  model  is  misspecihed  (which  would  be  the  case  in  practice),  the 
density  matching  approach  explained  in  Section  2.3  would  be  more  preferable  (Kanamori 
et  ah,  2010). 

When  multi-class  logistic  regression  classifiers  are  used,  density-ratios  among  multiple 
densities  can  be  estimated  simultaneously  (Bickel  et  ah,  2008).  This  is  useful,  e.g.,  for 
solving  multi-task  learning  problems  (Caruana  et  ah,  1997). 

2.2.3  Least-Squares  Probabilistic  Classifier 

Although  the  performance  of  these  general-purpose  non-linear  optimization  techniques 
has  been  improved  together  with  the  evolution  of  computer  environment  in  the  last 
decade,  training  logistic  regression  classifiers  is  still  computationally  expensive.  Here, 
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an  alternative  probabilistic  classification  algorithm  called  least-squares  probabilistic  clas¬ 
sifier  (LSPC;  Sugiyama,  2010)  is  described.  LSPC  is  computationally  more  efficient  than 
logistic  regression,  with  comparable  accuracy  in  practice. 

In  LSPC,  the  class-posterior  probability  p*(y\x)  is  modeled  as 

b 

P(y\x]0)  :=  ^20^{x,y)  =  ij>(x,y)T9, 

1=1 

where  i(i(x,  y)  (G  Mfc)  is  a  non-negative  basis  function  vector,  and  6  (g  M.b )  is  a  parameter 
vector.  The  class  label  y  takes  a  value  in  {1, . . . ,  c},  where  c  is  the  number  of  classes. 

The  basic  idea  of  LSPC  is  to  express  the  class-posterior  probability  p*(y\x)  in  terms  of 
the  equivalent  density-ratio  expression:  p*(x,y)/p*(x).  Then  the  density-ratio  estimation 
method  called  unconstrained  least-squares  importance  fitting  (uLSIF;  Kanamori  et  al., 
2009)  is  used  for  estimating  this  density-ratio.  Since  uLSIF  will  be  reviewed  in  detail  in 
Section  2.4.3,  we  only  describe  the  final  solution  here. 

Let 

^  n  c  i  n 

®--EE  fi>(xk,y)fi>(xk,y)T  and  h  :=  -^x()(xk,yk). 


n 


k= 1  y= 1 


k=  1 


Then  the  uLSIF  solution  is  given  analytically  as  9  =  ( H  +  A J;,)_1/i,  where  A  (>  0)  is  the 
regularization  parameter  and  J&  is  the  6-dimensional  identity  matrix.  In  order  to  assure 
that  the  output  of  LSPC  is  a  probability,  the  outputs  are  normalized  and  negative  outputs 
are  rounded  up  to  zero  (Yamada  et  ah,  2011): 


p{y\x)  = 


max(0,  fi)(x,  y)  9) 
S«'=i  max(0,  ip(x,  y')T9) 


A  standard  choice  of  basis  functions  ^{x^y)  would  be  a  kernel  model: 


p(y\x-,B)  =  ^  (V'A'Ia:.  x,j, 


(12) 


1=1 


where  K(x ,  x')  is  some  kernel  function  such  as  the  Gaussian  kernel  (7).  Then  the  matrix 
H  becomes  block-diagonal.  Thus,  we  only  need  to  train  a  model  with  n  parameters 
separately  c  times  for  each  class  y  =  1, . . . ,  c.  Since  all  the  diagonal  block  matrices  are 
the  same,  the  computational  complexity  for  computing  the  solution  is  0(n3  +  cn 2). 

Let  us  further  reduce  the  number  of  kernels  in  model  (12).  To  this  end,  we  focus  on 
a  kernel  function  K(x,xr)  that  is  “localized”.  Examples  of  such  localized  kernels  include 
the  popular  Gaussian  kernel.  The  idea  is  to  reduce  the  number  of  kernels  by  locating  the 
kernels  only  at  samples  belonging  to  the  target  class: 


p(y\x;9)  =  ^  9<f^K(x,  x 


(y)) 


t= i 


(13) 
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where  ny  is  the  number  of  training  samples  in  class  y  and  the  training  input 

samples  in  class  y.  The  rationale  behind  this  model  simplification  is  as  follows.  By 
definition,  the  class-posterior  probability  p*(y\x)  takes  large  values  in  the  regions  where 
samples  in  class  y  are  dense;  conversely,  p*(y\x)  takes  smaller  values  (i.e.,  close  to  zero) 
in  the  regions  where  samples  in  class  y  are  sparse.  When  a  non-negative  function  is 
approximated  by  a  localized  kernel  model,  many  kernels  may  be  needed  in  the  region 
where  the  output  of  the  target  function  is  large;  on  the  other  hand,  only  a  small  number 
of  kernels  would  be  enough  in  the  region  where  the  output  of  the  target  function  is  close 
to  zero.  Following  this  heuristic,  many  kernels  are  allocated  in  the  region  where  p*(y\x) 
takes  large  values,  which  can  be  achieved  by  Eq.(13). 

This  model  simplification  allows  one  to  further  reduce  the  computational  cost  since 
the  size  of  the  target  blocks  in  matrix  H  is  further  reduced.  In  order  to  determine  the 
riy-dimensional  parameter  vector  6^  =  (9^\  . . .  ,#i^)T  for  each  class  y ,  we  only  need  to 
solve  the  following  system  of  ny  linear  equations: 

(H[V)  +  \Iny)0(y)  =  h(v\  (14) 

.  .  ^(y) 

where  H  is  the  ny  x  ny  matrix,  and  h  ’  is  the  ny-dimensional  vector  defined  as 

ny  ny 

Sj’J  :=  -  S3  K(xf,  xf)  K(xf,  xf)  and  If  :=  -  £  K  (xf,  xf). 

»  k- 1  «  9-1 

-^( y ) 

Let  6  be  the  solution  of  Eq.(14).  Then  the  final  solution  is  given  by 


p(y\x) 


max 


0,  ^  0^K(x,  x 


(y)^ 

i  ) 


£=1 


max 

y'= i 


0,  ^  9(f  ^K{x,  x 


i  J 


i=  i 


(15) 


For  the  simplified  model  (13),  the  computational  complexity  for  computing  the  solution 
is  O(criy) — when  ny  =  n/c  for  all  y,  this  is  equal  to  0(c~2n3).  Thus,  this  approach  is 
computationally  highly  efficient  for  multi-class  problems  with  large  c. 

A  MATLAB®  implementation  of  LSPC  is  available  from 

http://sugiyama-www.cs.titech.ac.jp/~sugi/software/LSPC/ 


2.2.4  Remarks 

Density-ratio  estimation  by  probabilistic  classification  can  successfully  avoid  density  es¬ 
timation  by  casting  the  problem  of  density-ratio  estimation  as  the  problem  of  estimating 
the  ‘class’-posterior  probability.  An  advantage  of  the  probabilistic  classification  approach 
over  the  moment  matching  approach  explained  in  Section  2.1  is  that  cross-validation  can 
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be  used  for  model  selection.  Furthermore,  existing  software  packages  of  probabilistic 
classification  algorithms  can  be  directly  used  for  density-ratio  estimation. 

The  probabilistic  classification  approach  with  logistic  regression  was  shown  to  have 
a  suitable  theoretical  property  (Qin,  1998):  if  the  logistic  regression  model  is  correctly 
specified,  the  probabilistic  classification  approach  is  optimal  among  a  broad  class  of  semi- 
parametric  estimators.  However,  this  strong  theoretical  property  is  not  true  when  the 
correct  model  assumption  is  not  fulfilled. 

An  advantage  of  the  probabilistic  classification  approach  is  that  it  can  be  used  for 
estimating  density-ratios  among  multiple  densities  by  multi-class  probabilistic  classifiers. 
In  this  context,  the  least-squares  probabilistic  classifier  (LSPC)  would  be  practically  useful 
due  to  its  computational  efficiency. 

2.3  Density  Matching 

Here,  we  describe  a  framework  of  density-ratio  estimation  by  density  matching  under  the 
KL  divergence. 

2.3.1  Basic  Framework 

Let  r(x)  be  a  model  of  the  true  density-ratio  r*(x)  =  p*m(x)  /  p^fix) .  Then  the  numerator 
density  p*u(x)  may  be  modeled  by  pnu(x )  =  r(x)p*Ac(x).  Now  let  us  consider  the  KL 
divergence  from  p^u(x)  to  pnu(x): 


where  C  :=  f  p*u(:r)  log  is  a  constant  irrelevant  to  r,  and  KL(r)  is  the  relevant 

P  de  v®/ 

part: 


Since  pnu(*)  is  a  probability  density  function,  its  integral  should  be  one: 


Furthermore,  the  density  pn u(a;)  should  be  non- negative,  which  can  be  achieved  by 
r(x)  >  0  for  all  x.  Combining  these  equations  together,  we  have  the  following  optimiza¬ 
tion  problem. 


1 


^logrOn 
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This  formulation  is  called  the  KL  importance  estimation  procedure  (KLIEP;  Sugiyama 
et  al.,  2008). 

Possible  hyper-parameters  in  KLIEP  (such  as  basis  parameters  and  regularization 
parameters)  can  be  optimized  using  cross-validation  with  respect  to  the  KL  divergence, 
where  the  numerator  samples  {x™}"h'j  appearing  in  the  objective  function  may  only  be 
cross- validated  (Sugiyama  et  al.,  2008). 

Below,  practical  implementations  of  KLIEP  for  various  density-ratio  models  are  de¬ 
scribed. 

2.3.2  Linear  and  Kernel  Models 

Let  us  employ  a  linear  model  for  density-ratio  estimation. 

b 

r(x)  =  ^  Oe'ip^x)  =  t/>( x)T0 ,  (16) 

i=i 

where  i/>( x )  :  W1  — y  M.b  is  a  non-negative  basis  function  vector,  and  6  (g  M.b )  is  a  parameter 
vector.  Then  the  KLIEP  optimization  problem  for  the  linear  model  is  expressed  as  follows 
(Sugiyama  et  al.,  2008). 

^nu 

max  - log U^ix™1)1  G)  s.t.  =  7  and  0  >  0&, 

0eR6  nnu  ^ 

1=1 

where  if>de  :=  ^ 

Since  the  above  optimization  problem  is  convex,  there  exists  the  unique  global  optimum 
solution.  Furthermore,  the  KLIEP  solution  tends  to  be  sparse ,  i.e.,  many  parameters  take 
exactly  zero,  because  of  the  non-negativity  constraint.  Such  sparsity  would  contribute  to 
reducing  the  computation  time  when  computing  estimated  density-ratio  values.  As  can 
be  confirmed  from  the  above  optimization  problem,  the  denominator  samples  {xjle}"L<j 
appear  only  in  terms  of  the  basis-transformed  mean  i/?de.  Thus,  KLIEP  for  linear  models 
is  computationally  efficient  even  when  the  number  n&e  of  denominator  samples  is  very 
large. 

The  performance  of  KLIEP  depends  on  the  choice  of  the  basis  functions  ^(x).  As 
explained  below,  the  use  of  the  following  Gaussian  kernel  model  would  be  reasonable: 

r(x)  =  9tR(x’  *?u)>  (17) 

i=\ 

where  K(x,x')  is  the  Gaussian  kernel  (7).  The  reason  why  the  numerator  samples 
not  the  denominator  samples  are  chosen  as  the  Gaussian  centers 

is  as  follows.  By  definition,  the  density-ratio  r*(x)  tends  to  take  large  values  if  pde(x) 
is  small  and  p^u(x)  is  large.  Conversely,  r*(x)  tends  to  be  small  (i.e.,  close  to  zero)  if 
pde(x)  is  large  and  p*u(:r)  is  small.  When  a  non-negative  function  is  approximated  by  a 
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Gaussian  kernel  model,  many  kernels  may  be  needed  in  the  region  where  the  output  of 
the  target  function  is  large.  On  the  other  hand,  only  a  small  number  of  kernels  would  be 
enough  in  the  region  where  the  output  of  the  target  function  is  close  to  zero.  Following 
this  heuristic,  many  kernels  are  allocated  in  the  region  where  p*u(:r)  takes  large  values, 
which  can  be  achieved  by  setting  the  Gaussian  centers  at  |/c'lu}/tj . 

The  KLIEP  methods  for  linear/kernel  models  are  referred  to  as  linear  KLIEP  (L- 
KLIEP)  and  kernel  KLIEP  (K-KLIEP),  respectively.  A  MATLAB®  implementation  of 
the  K-KLIEP  algorithm  is  available  from 

http:/ /sugiyama- www.es. titech.ac.jp/~sugi/software/KLIEP/ 


2.3.3  Log-Linear  Models 


Another  popular  model  choice  would  be  the  log-linear  model  (Tsuboi  et  al.,  2009; 
Kanamori  et  al.,  2010): 

r(x;  0,  Bq)  =  exp  (t/j(cc)t0  +  0O)  ,  (18) 


where  9q  is  a  normalization  parameter.  From  the  normalization  constraint 


1 

^-de 


^de 


3= 1 


1, 


9n  is  determined  as 


^de 


0o  =  -  log  (  —  ^  exp  (</>( xf)T9 ) 
\nde  j= i 


Then  the  density-ratio  model  is  expressed  as 

g)  =  exp  (y>(x)Te) 

L  E”£  exp  (4>(xf  ye) 

By  definition,  outputs  of  the  log-liuear  model  r(x\  6)  are  non-negative  for  all  x.  Thus, 
we  do  not  need  the  non-negativity  constraint  on  the  parameter.  Then  the  KLIEP  opti¬ 
mization  criterion  is  expressed  as 


max 

0eR6 


—T 


R'de 


-  log  - T  -■■'l.l  i/o.e 


de^T 
3 


0) 


J 


where  t/?nu  :=  —  i/?(a;°u).  This  is  an  unconstrained  convex  optimization  problem, 

so  the  global  optimal  solution  can  be  obtained  by,  e.g.,  the  gradient  method  and  (quasi- 
)Newton  methods.  Since  the  numerator  samples  appear  only  in  terms  of  the 

basis-transformed  mean  xp nu,  KLIEP  for  log-linear  models  is  computationally  efficient 
even  when  the  number  nnu  of  numerator  samples  is  very  large  (cf.  KLIEP  for  linear/kernel 
models  is  computationally  efficient  when  n&e  is  very  large;  see  Section  2.3.2). 

The  KLIEP  method  for  log-linear  models  is  called  log-linear  KLIEP  (LL-KLIEP). 
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2.3.4  Gaussian  Mixture  Models 


In  the  Gaussian  kernel  model  (17),  the  Gaussian  shape  is  spherical  and  its  width  is  con¬ 
trolled  by  a  single  width  parameter  er.  It  is  possible  to  use  correlated  Gaussian  kernels,  but 
choosing  the  covariance  matrix  via  cross-validation  would  be  computationally  intractable. 

Another  option  is  to  also  estimate  the  covariance  matrix  directly  from  data.  For  this 
purpose,  the  Gaussian  mixture  model  comes  in  handy  (Yamada  and  Sugiyama,  2009): 

C 

r(x]{6k,Hk^k}ck=i)  =  ^29kK(x]nk,'Ek),  (19) 

k=  1 


where  c  is  the  number  of  mixing  components,  {0k}k=i  are  mixing  coefficients,  {Hk}t=i 
are  means  of  Gaussian  functions,  {Efc}):=1  are  covariance  matrices  of  Gaussian  functions, 
and  K(x\  /i.  E)  is  the  Gaussian  kernel  with  mean  /i  and  covariance  matrix  E: 

K(x ;  n,  E)  :=  exp  f-i(®  ~  pi)TS"1(a;  -  /x)^  .  (20) 

Note  that  E  should  be  positive  definite,  i.e. ,  all  the  eigenvectors  of  E  should  be  strictly 
positive. 

For  the  Gaussian  mixture  model  (19),  the  KLIEP  optimization  problem  is  expressed 
as 


max 


nn 


log  (  dkK(xTi  t*k, 


i= 1 


Kk=l 


-i  Tide  C 

s.t.  —  f!k,  Ejfc)  =  1, 

nde  k  i 

6k  >  0  and  E^  >-  O  for  k  —  1, . . . ,  c, 


where  E^  >-  O  means  that  E k  is  positive  definite. 

The  above  optimization  problem  is  non- convex,  and  there  is  no  known  method  for 
obtaining  the  global  optimal  solution.  In  practice,  a  local  optimal  solution  may  be  nu¬ 
merically  obtained  by,  e.g.,  a  fixed-point  method. 

The  KLIEP  method  for  Gaussian  mixture  models  is  called  Gaussian-mixture  KLIEP 
(GM-KLIEP). 


2.3.5  Probabilistic  PCA  Mixture  Models 

The  Gaussian  mixture  model  explained  above  would  be  more  flexible  than 
linear/kernel/log- linear  models  and  suitable  for  approximating  correlated  density-ratio 
functions.  However,  when  the  target  density-ratio  function  is  (locally)  rank-deficient,  its 
behavior  could  be  unstable  since  inverse  covariance  matrices  are  included  in  the  Gaussian 
function  (see  Eq.(20)).  To  cope  with  this  problem,  the  use  of  a  mixture  of  probabilis¬ 
tic  principal  component  analyzers  (PPCA;  Tipping  and  Bishop,  1999)  was  proposed  for 
density-ratio  estimation  (Yamada  et  ah,  2010). 
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The  PPCA  mixture  model  is  defined  as 


C 


r(x ;  {9k,nk,a2k,Wk}ck=1)  =  ^  6kK(x-:  /j,k,  a2k,  Wk), 


k=  1 


where  c  is  the  number  of  mixing  components  and  {0k}k=1  are  mixing  coefficients. 
K(x ;  fr,  a2,  W)  is  a  PPCA  model  defined  by 


K(x;  /i,  a2,  W)  =  (2no2)  2det(C)  2  exp 


where  ‘det’  denotes  the  determinant,  /1  is  the  mean  of  the  Gaussian  function,  a2  is  the 


variance  of  the  Gaussian  function,  W  is  a  d  x  m  ‘projection’  matrix  onto  a  m-dimensional 
latent  space  (where  m  <  d),  and  C  =  WWT  +  o2 Id- 
Then  the  KLIEP  optimization  criterion  is  expressed  as 


Ok  >  0  for  k  =  1, . . . ,  c. 


The  above  optimization  is  non-convex,  so  a  local  optimal  solution  may  be  found  by 
some  algorithm  in  practice.  When  the  dimensionality  of  the  latent  space,  m,  is  equal 
to  the  entire  dimensionality  d ,  PPCA  models  are  reduced  to  ordinary  Gaussian  models. 
Thus,  PPCA  models  can  be  regarded  as  an  extension  of  Gaussian  models  to  (locally) 
rank-deficient  data. 

The  KLIEP  method  for  PPCA  mixture  models  is  called  PPCA-mixture  KLIEP  (PM- 
KLIEP). 

2.3.6  Remarks 

Density-ratio  estimation  by  density  matching  under  the  KL  divergence  allows  one  to 
avoid  density  estimation  when  estimating  density-ratios  (Section  2.3.1).  Furthermore, 
cross-validation  with  respect  to  the  KL  divergence  is  available  for  model  selection. 

The  method,  called  the  KL  importance  estimation  procedure  (KLIEP),  is  applicable 
to  a  variety  of  models  such  as  linear  models,  kernel  models,  log-linear  models,  Gaussian 
mixture  models,  and  probabilistic  principal-component-analyzer  mixture  models. 


2.4  Density- Ratio  Fitting 

Here,  we  describe  a  framework  of  density-ratio  estimation  by  least-squares  density-ratio 
fitting  (Kanamori  et  ah,  2009). 
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2.4.1  Basic  Framework 


The  model  r{x )  of  the  true  density-ratio  function  r*(x)  =  p*nn(x)  /  p*de(x)  is  learned  so 
that  the  following  squared  error  SQ'  is  minimized: 

SQ'O)  :=  \  f  (r(x)  -r*(x))2p^e(x)dx. 


=  n  /  r(X)  Pde(x)dx  ~  /  r(x)p*n n(X)dx  +  o  /  r*  (X)Pn u(*)d*> 


where  the  last  term  is  a  constant  and  therefore  can  be  safely  ignored.  Let  us  denote  the 
first  two  terms  by  SQ: 


SQ(r) 


J  r(x)2p*de(x)dx 


r(x)p*nu(x)dx- 


Approximating  the  expectations  in  SQ  by  empirical  averages,  we  obtain  the  following 
optimization  problem: 


min 

r 


^de 

£’-(d')2 


-3= 1 


^nu 

1=1 


(21) 


We  refer  to  this  formulation  as  least-squares  importance  fitting  (LSIF).  Possible  hyper- 
parameters  (such  as  basis  parameters  and  regularization  parameters)  can  be  optimized 
by  cross-validation  with  respect  to  the  SQ  criterion  (Kanamori  et  ah,  2009). 

Below,  two  implementations  of  LSIF  for  the  following  linear/kernel  models  are  de¬ 
scribed: 


b 

r(x)  =  =  fi>(x)Td, 

i=\ 

where  fi>(x)  :  M,d  — »  M.b  is  a  non-negative  basis  function  vector,  and  6  (e  M.b )  is  a  parameter 
vector.  Since  this  model  is  the  same  form  as  that  used  in  KLIEP  for  linear/kernel  models 
(Section  2.3.2),  we  may  use  the  same  basis  design  idea  described  there. 

For  the  above  linear/kernel  models,  Eq.(21)  is  expressed  as 

min  -  6tHG  -  hT 6  , 

0eKb  2 


where 


^de 


H:=—  WV>(*f)i/>(zf)T  and  £:=—£>(*; 
ndc  ^  J  J  n„„  ^ 


3= 1 


i=  1 


(22) 
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2.4.2  Implementation  with  Non-Negativity  Constraint 


Here,  we  describe  an  implementation  of  LSIF  with  non- negativity  constraint. 

Let  us  impose  non- negativity  constraint  6  >  Ob  since  the  density-ratio  function  is 
non-negative  by  definition.  Let  us  further  add  the  following  regularization  term  to  the 
objective  function: 

b 

t= i 


The  term  l(J~ 6  works  as  the  ^-regularizer  if  it  is  combined  with  the  non-negativity  con¬ 
straint.  Then  the  optimization  problem  is  expressed  as  follows. 


min 

0eR6 


-eTHd  -hTe  +  xiJe 


s.t.  6  >  Ob, 


where  A  (>  0)  is  the  regularization  parameter.  We  refer  to  this  method  as  constrained 
LSIF  (cLSIF;  Kanamori  et  ah,  2009).  The  cLSIF  optimization  problem  is  a  convex 
quadratic  program,  so  the  unique  global  optimal  solution  may  be  computed  by  a  standard 
optimization  software. 

We  can  also  use  the  -^-regularizer  6  6 ,  instead  of  the  C -regularizer  1 JO,  without 
changing  the  computational  property  (i.e. ,  the  optimization  problem  is  still  a  convex 
quadratic  program).  However,  using  the  £ \ -regularizer  would  be  more  advantageous  since 
the  solution  tends  to  be  sparse,  i.e.,  many  parameters  take  exactly  zero  (Williams,  1995; 
Tibshirani,  1996;  Chen  et  al.,  1998).  Furthermore,  as  shown  in  Kanamori  et  ah  (2009), 
the  use  of  the  C -regularizer  allows  one  to  compute  the  entire  regularization  path  effi¬ 
ciently  (Best,  1982;  Efron  et  ah,  2004;  Hastie  et  ah,  2004),  which  highly  improves  the 
computational  cost  in  the  model  selection  phase. 

An  R  implementation  of  cLSIF  is  available  from 


http: //www. math. cm. is. nagoya-u.ac.jp/  kanamori /software/ LSIF / 


2.4.3  Implementation  without  Non- Negativity  Constraint 


Here,  we  describe  another  implementation  of  LSIF  without  the  non- negativity  constraint 
called  unconstrained  LSIF  (uLSIF). 

Without  the  non- negativity  constraint,  the  linear  regularizer  1  Jd  used  in  cLSIF  does 
not  work  as  a  regularizer.  For  this  reason,  a  quadratic  regularizer  6  6  is  adopted  here. 
Then  we  have  the  following  optimization  problem. 


min  -  6tH6  -  hT 6  +  -6t6 
0gr6  2  2 


(23) 


Eq.(23)  is  an  unconstrained  convex  quadratic  program,  and  the  solution  can  be  computed 
analytically  by  solving  the  following  system  of  linear  equations: 
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where  J&  is  the  6-dimensional  identity  matrix.  The  solution  6  of  the  above  equation  is 
given  by 

e  =  (H  +  xi^h. 

Since  the  non- negativity  constraint  6  >  Of,  was  dropped,  some  of  the  obtained  param¬ 
eters  could  be  negative.  To  compensate  for  this  approximation  error,  the  solution  may  be 
modified  as  follows  (Kanamori  et  ah,  2012): 

max(0,  i/>( x)Td ). 

This  is  the  solution  of  the  approximation  method  called  unconstrained  LSIF  (uLSIF; 
Kanamori  et  ah,  2009).  An  advantage  of  uLSIF  is  that  the  solution  can  be  analytically 
computed  just  by  solving  a  system  of  linear  equations.  Therefore,  its  computation  is 
stable  when  A  is  not  too  small. 

A  practically  important  advantage  of  uLSIF  over  cLSIF  is  that  the  score  of  leave-one- 
out  cross-validation  (LOOCV)  can  be  computed  analytically  (Kanamori  et  ah,  2009) — 
thanks  to  this  property,  the  computational  complexity  for  performing  LOOCV  is  the  same 
order  as  just  computing  a  single  solution. 

A  MATLAB®  implementation  of  uLSIF  is  available  from 

http:  / /sugiyama-www. cs.titech.ac.jp/~sugi/software/uLSIF  / 
and  an  R  implementation  of  uLSIF  is  available  from 

http:  / /www. math. cm. is. nagoya-u.ac.jp/~kanamori/software/LSIF  / 

2.4.4  Remarks 

One  can  successfully  avoid  density  estimation  by  least-squared  density-ratio  fitting.  The 
least-squares  methods  for  linear/kernel  models  are  computationally  more  advantageous 
than  alternative  approaches  such  as  moment  matching  (Section  2.1),  probabilistic  classifi¬ 
cation  (Section  2.2),  and  density  matching  (Section  2.3).  Indeed,  the  constrained  method 
(cLSIF)  for  the  fA-regularizer  is  equipped  with  a  regularization  path  tracking  algorithm. 
Furthermore,  the  unconstrained  method  (uLSIF)  allows  one  to  compute  the  density-ratio 
estimator  analytically;  the  leave-one-out  cross-validation  score  can  also  be  computed  in  a 
closed  form.  Thus,  the  overall  computation  of  uLSIF  including  model  selection  is  highly 
efficient. 

The  fact  that  uLSIF  has  an  analytic-form  solution  is  actually  very  useful  beyond 
its  computational  efficiency.  When  one  wants  to  optimize  some  criterion  defined  us¬ 
ing  a  density-ratio  estimate  (e.g.,  mutual  information ,  see  Cover  and  Thomas,  2006), 
the  analytic-form  solution  of  uLSIF  allows  one  to  compute  the  derivative  of  the  target 
criterion  analytically.  Then  one  can  develop,  e.g.,  gradient-based  and  (quasi-)Newton  al¬ 
gorithms  for  optimization.  This  property  can  be  successfully  utilized,  e.g.,  in  identifying 
the  central  subspace  in  sufficient  dimension  reduction  (Suzuki  and  Sugiyama,  2010),  End¬ 
ing  independent  components  in  independent  component  analysis  (Suzuki  and  Sugiyama, 
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2011),  performing  dependence-minimizing  regression  in  causality  learning  (Yamada  and 
Sugiyama,  2010),  and  identifying  the  hetero-distributional  subspace  in  direct  density-ratio 
estimation  with  dimensionality  reduction  (Sugiyama  et  ah,  2011b). 

3  Unified  Framework  by  Density- Ratio  Matching 

As  reviewed  in  the  previous  section,  various  density-ratio  estimation  methods  have  been 
developed  so  far.  In  this  section,  we  propose  a  new  framework  of  density-ratio  estimation 
by  density-ratio  matching  under  the  Bregman  divergence  (Bregman,  1967),  which  includes 
various  useful  divergences  (Banerjee  et  al.,  2005;  Stummer,  2007).  This  framework  is  a 
natural  extension  of  the  least-squares  approach  described  in  Section  2.4,  and  includes  the 
existing  approaches  reviewed  in  the  previous  section  as  special  cases  (Section  3.2).  Then 
we  provide  interpretation  of  density-ratio  matching  from  two  different  views  in  Section  3.3. 
Finally,  we  give  a  new  instance  of  density-ratio  matching  based  on  the  power  divergence 
in  Section  3.4. 

3.1  Basic  Framework 

A  basic  idea  of  density-ratio  matching  is  to  directly  fit  a  density-ratio  model  r{x )  to  the 
true  density-ratio  function  r*(x)  under  some  divergence.  At  a  glance,  this  density-ratio 
matching  problem  is  equivalent  to  the  regression  problem ,  which  is  aimed  at  estimating  a 
real-valued  function.  However,  density-ratio  matching  is  essentially  different  from  regres¬ 
sion  since  samples  of  the  true  density-ratio  function  are  not  available.  Here,  we  employ  the 
Bregman  (BR)  divergence  for  measuring  the  discrepancy  between  the  true  density-ratio 
function  and  the  density-ratio  model. 

The  BR  divergence  is  an  extension  of  the  Euclidean  distance  to  a  class  of  divergences 
that  share  similar  properties.  Let  /  be  a  differentiable  and  strictly  convex  function.  Then 
the  BR  divergence  associated  with  /  from  t*  to  t  is  defined  as 

BR^(tlt)  :=  f(t*)  -  f(t)  -  df(t)(t *  -  t ), 

where  df  is  the  derivative  of  /.  Note  that 

f(t)  +  df{t)(t*  - 1 ) 

is  the  value  of  the  first-order  Taylor  expansion  of  /  around  t  evaluated  at  t*.  Thus,  the 
BR  divergence  evaluates  the  difference  between  the  value  of  /  at  point  t*  and  its  linear 
extrapolation  from  t  (see  Figure  1).  BRj(f*||f)  is  a  convex  function  with  respect  to  t*, 
but  not  necessarily  convex  with  respect  to  t. 

Here  the  discrepancy  from  the  true  density-ratio  function  r*  to  a  density-ratio  model 
r  is  measured  using  the  BR  divergence  as 

BR/(r*||r)  :=  J p*de(x)  (/(r*(®))  -  f(r(x)) 

-  df(r(x))(r'(x)  -  r(x)))dx. 


(24) 
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A  motivation  for  this  choice  is  that  the  BR  divergence  allows  one  to  directly  obtain  an 
empirical  approximation  for  any  /.  Indeed,  let  us  first  extract  a  relevant  part  of  BR^  (r*  ||r) 
as 

BRj  (r*||r)  =  BR/  (r)  +  C, 

where  C  f  pde(x)f(r*(x)) da?  is  a  constant  independent  of  r,  and 

BR  /  (r)  :=  I  p*de(x)(df(r(x))r(x)  -  f(r(x)f)dx  -  J  p*m(x)df(r(x))dx 
Then  an  empirical  approximation  BR f  (r)  of  BR f  (r)  is  given  by 

-i  ^de  -i  R'nu 

BR/M  :=  —  J2  (9f(r(xf))rK)  -f(r(d’))) - 'EdmO)- 

^■de  ,  '  '  ^-nu 

]=l  1=1 

This  immediately  gives  the  following  optimization  criterion. 

minBRf  (r) , 

r 

where  r  is  searched  within  some  class  of  functions. 

3.2  Existing  Methods  as  Density-Ratio  Matching 

Here,  we  show  that  various  density-ratio  estimation  methods  reviewed  in  the  previous 
section  can  be  accommodated  in  the  density-ratio  matching  framework  (see  Table  1). 


(25) 


(26) 


3.2.1  Least-Squares  Importance  Fitting 

Here,  we  show  that  the  least-squares  importance  fitting  (LSIF)  approach  introduced  in 
Section  2.4.1  is  an  instance  of  density-ratio  matching.  More  specifically,  there  exists  a 
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Table  1:  Summary  of  density-ratio  estimation  methods.  In  the  table,  ‘LSIF’,  ’KMM’,  ‘LR’, 
and  ’KLIEP’  stand  for  ‘least-squares  importance  fitting’,  ‘kernel  mean  matching’,  ‘logistic 
regression’,  and  ‘Kullback-Leibler  Importance  Estimation  Procedure’,  respectively. 


Method  (Section) 

fit) 

Model  selection 

Optimization 

LSIF  (3.2.1) 

it  - 1)72 

Available 

Analytic 

KMM  (3.2.2) 

it  -  1)72 

Partially 

unavailable 

Analytic 

LR  (3.2.3) 

t  log  t  ( 1  T  t )  log(l  +  t) 

Available 

Convex 

KLIEP  (3.2.4) 

t  log  t  —  t 

Available 

Convex 

Robust  (3.4) 

(■ t1+a  —  t)/a,  a  >  0 

Available 

Convex  (0  <  a  <  1) 
Non-convex  (a  >  1) 

BR  divergence  such  that  the  optimization  problem  of  density-ratio  matching  is  reduced 
to  that  of  LSIF. 

When 

f(t)  =  \(t-  \)\ 

BR  (24)  is  reduced  to  the  squared  (SQ)  distance: 

SQ'(i*l|i)  :=  (y*  -  «)2. 

Following  Eqs.(25)  and  (26),  let  us  denote  SQ  without  an  irrelevant  constant  term  by 
SQ  (r)  and  its  empirical  approximation  by  SQ  (r),  respectively: 

SQ  (r)  '=\  [ P*de(x)r(x)2dx  -  J p*nu(x)r(x) dag 

-|  ^de  -j  Unu 

SQ  (r)  :=  —  ^r(xf)2  -  — 

z‘'tde  ,  "nu 

j=l  i=l 

This  agrees  with  the  LSIF  formulation  given  in  Section  2.4.1. 

3.2.2  Kernel  Mean  Matching 

Here,  we  show  that  the  solution  of  the  moment  matching  method,  kernel  mean  match¬ 
ing  (KMM)  introduced  in  Section  2.1,  actually  agrees  with  that  of  unconstrained  LSIF 
(uLSIF;  see  Section  2.4.3)  for  specific  kernel  models.  Since  uLSIF  was  shown  to  be  an  in¬ 
stance  of  density-ratio  matching  in  Section  3.2.1,  the  KMM  solution  can  also  be  obtained 
in  the  density-ratio  matching  framework. 

Let  us  consider  the  following  kernel  density-ratio  model: 

^de 

r(x)  =  ^28tK(x,x$e), 
i=i 


(27) 
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where  K(x,x')  is  a  universal  reproducing  kernel  (Steinwart,  2001)  such  as  the  Gaussian 
kernel  (7).  Note  that  uLSIF  and  KLIEP  use  the  numerator  samples  {a?™1}))™!  as  Gaussian 
centers,  while  the  model  (27)  adopts  the  denominator  samples  as  Gaussian  cen¬ 

ters.  For  the  density-ratio  model  (27),  the  matrix  H  and  the  vector  h  defined  by  Eq.(22) 
are  expressed  as 

H  K  dede  and  h  -K7l('.nu  1  ;;llu , 

^de  Gu 

where  Kdede  and  F^de,nu  are  dehned  in  Eq.(8).  Then  the  (unregularized)  uLSIF  solution 
(see  Section  2.4.3  for  details)  is  expressed  as 

0ULSIF  =  H  h  =  —K^dcKde:nU  l„nu.  (28) 

^nu 

On  the  other  hand,  let  us  consider  an  inductive  variant  of  KMM  for  the  kernel  model 
(27)  (see  Section  2.1.2).  For  the  density-ratio  model  (27),  the  design  matrix  \I/de  dehned  by 
Eq.(5)  agrees  with  K de,de-  Then  the  KMM  solution  is  given  as  follows  (see  Section  2.1.2): 


#KMM  — 


— ^(^de-f^de.de^de)  ^de-Kde.nuln, 


#uLSIF- 


3.2.3  Logistic  Regression 

Here,  we  show  that  the  logistic  regression  approach  introduced  in  Section  2.2.2  is  an 
instance  of  density-ratio  matching.  More  specifically,  there  exists  a  BR  divergence  such 
that  the  optimization  problem  of  density-ratio  matching  is  reduced  to  that  of  the  logistic 
regression  approach. 

When 


fit)  =  t  log  t  (1  T  t)  log(l-ft), 


BR  (24)  is  reduced  to  the  binary  Kullback-Leibler  (BKL)  divergence: 

BKL'(f  ||i)  :=  (1  +  t*)  log  +  t*  log  A 

The  name  ‘BKL’  comes  from  the  fact  that  BKLV*!^)  is  expressed  as 


BKL'(t*||i) 


(1  +  i*)KLbm 


where  KLbm  is  the  KL  divergence  for  binary  random  variables  defined  as 


P 

KLbin(p,  q)  \=  p  log  -  +  (1  —  p)  log 

q 


1  —  p 

i -q 


for  0  <  p,  q  <  1.  Thus,  BKL'  agrees  with  KLbin  up  to  the  constant  factor  (1  + 1*). 
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Following  Eqs.(25)  and  (26),  let  us  denote  BKL  without  an  irrelevant  constant  term 
by  BKL  (r)  and  its  empirical  approximation  by  BKL  (r),  respectively: 


BKL  (r)  :=-  J p*de(x)  log  ^  +  *  dx  -  J p*u(x)  log 

1 


r(x) 


-j  ^de 

BKL  (r)  := - log 

nde  i= i 


1  +  r(x) 


dx, 


1  +  r(xj 


de\ 


Th* 


r  x- 


i= 1 


1  +  r(x- 


(29) 


Eq.(29)  is  a  generalized  expression  of  logistic  regression  (Qin,  1998).  Indeed,  when  ride  = 
nnu,  the  ordinary  logistic  regression  formulation  (11)  can  be  obtained  from  Eq.(29)  (up 
to  a  regularizer)  if  the  log-linear  density-ratio  model  (18)  without  the  constant  term  90  is 
used. 


3.2.4  Kullback-Leibler  Importance  Estimation  Procedure 

Here,  we  show  that  the  KL  importance  estimation  procedure  (KLIEP)  introduced  in  Sec¬ 
tion  2.3.1  is  an  instance  of  density-ratio  matching.  More  specifically,  there  exists  a  BR 
divergence  such  that  the  optimization  problem  of  density-ratio  matching  is  reduced  to 
that  of  the  KLIEP  approach. 

When 


f{t)  =  tlogt  -  t, 


BR  (24)  is  reduced  to  the  unnormalized  Kullback-Leibler  (UKL)  divergence: 

UKL,(£*||t)  :=t*logj-t*  +  t. 

Following  Eqs.(25)  and  (26),  let  us  denote  UKL  without  an  irrelevant  constant  term  by 
UKL  (r)  and  its  empirical  approximation  by  UKL  (r),  respectively: 

UKL  (r)  :=  j  pde(x)r(x) dx  —  j  p*u(x)  logr(x)dx,  (30) 

-j  ^de  -j  'Llnu 

UKL  (r)  :=  - ^r(xf) - ^logr(x"u).  (31) 

ride  ^  rinn  .  J 

j=i  i=i 

Let  us  further  impose  that  the  ratio  model  r(x)  is  non-negative  for  all  x  and  is  normalized 
with  respect  to  (x)16}"^ : 


1 

ride 


nde 

ErH') 


3= 1 


1. 
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Then  the  optimization  criterion  is  reduced  to  as  follows. 

max  - log  r (a/lu) 

r  nnu  ^ 

1=1 

^de 

s.t.  -  >  r(x^e)  =  1  and  r(x)  >  0  for  all  x. 

This  agrees  with  the  KL1EP  formulation  reviewed  in  Section  2.3.1. 

3.3  Interpretation  of  Density- Ratio  Matching 

ffere,  we  show  the  correspondence  between  the  density-ratio  matching  approach  and  a  di¬ 
vergence  estimation  method,  and  the  correspondence  between  the  density-ratio  matching 
approach  and  a  moment-matching  approach. 

3.3.1  Divergence  Estimation  View 

We  first  show  that  our  density-ratio  matching  formulation  can  be  interpreted  as  diver¬ 
gence  estimation  based  on  the  Ali-Silvey-Csiszar  (ASC)  divergence  (Ali  and  Silvey,  1966; 
Csiszar,  1967),  which  is  also  known  as  the  f -divergence. 

Let  us  consider  the  ASC  divergence  for  measuring  the  discrepancy  between  two  prob¬ 
ability  density  functions.  An  ASC  divergence  is  defined  using  a  convex  function  f  such 
that  /( 1)  =  0  as  follows: 

Ascvwjrij  -  JpU*)f  (§H)  d*-  <32) 

The  ASC  divergence  is  reduced  to  the  Kullback-Leibler  (KL)  divergence  (Kullback  and 
Leibler,  1951)  if  /(f)  =  f  logf,  and  the  Pearson  (PE)  divergence  (Pearson,  1900)  if  /(f)  = 
\{t-l)\ 

Let  df(t )  be  the  sub- differential  of  /  at  a  point  t  (g  M),  which  is  a  set  defined  as 
follows  (Rockafellar,  1970): 

df{t)  :=  {z  G  M  |  f(s)  >  /(f)  +  z(s  —  f),  Vs  G  M}. 

If  /  is  differentiable  at  t,  then  the  sub-differential  is  reduced  to  the  ordinary  derivative. 
Although  the  sub- differential  is  a  set  in  general,  for  simplicity,  we  treat  df(r )  as  a  single 
element  if  there  is  no  confusion.  Below,  we  assume  that  /  is  closed ,  i.e. ,  its  epigraph  is  a 
closed  set  (Rockafellar,  1970). 

Let  f*  be  the  conjugate  dual  function  associated  with  /  defined  as 
f*(u)  :=  sup  [fit  -  /(f)]  =  -  inf  [/(f)  -  tu}. 

t  1 
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Since  /  is  a  closed  convex  function,  we  also  have 

f{t)  =  -  inf[/*(u)  -  tu\.  (33) 

U 

For  the  KL  divergence  where  fit)  =  tlogt,  the  conjugate  dual  function  is  given  by 
f*(u)  =  exp [u  —  1).  For  the  PE  divergence  where  f(t)  —  (t  —  l)2/2,  the  conjugate  dual 
function  is  given  by  f*(u)  =  u2 / 2  +  u. 

Substituting  Eq.(33)  into  Eq.(32),  we  have  the  following  lower  bound  (Keziou,  2003): 

ASC/(p*u||pje)  =  —  inf  ASCy(g), 

9 


where 


ASC  'f{g)  :=  I  f*(g(x))p*de(x)dx  —  J  g(x)p*nu(x)dx.  (34) 

By  taking  the  derivative  of  the  integrand  for  each  x  and  equating  it  to  zero,  we  can  show 
that  the  infimum  of  ASCj  is  attained  at  g  such  that 

8f(.a(x))  =  =  r'(x). 

Thus,  minimizing  ASC 'f(g)  yields  the  true  density-ratio  function  r*(x). 

For  some  g,  there  exists  r  such  that 


g  =  df(r). 


Then  f*(g)  is  expressed  as 


f*(g) 


sup 

s 


sdf(r)  -  f(s ) 


According  to  the  variational  principle  (Jordan  et  ah,  1999),  the  supremum  in  the  right- 
hand  side  of  the  above  equation  is  attained  at  s  =  r.  Thus,  we  have 


f*(g)  =  rdf{r)  -  f(r). 

Then  the  lower  bound  ASC 'Ag)  defined  by  Eq.(34)  can  be  expressed  as 

ASC 'f[g)  =  j  p*de(x)(r(x)df(r(x))  -  f(r(x))^dx  -  j  df(r(x))p*nu(x)dx. 

This  is  equivalent  to  the  criterion  BR f  defined  by  Eq.(25).  Thus,  density-ratio  matching 
under  the  BR  divergence  can  be  interpreted  as  divergence  estimation  under  the  ASC 
divergence. 
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3.3.2  Moment  Matching  View 

Next,  we  investigate  the  correspondence  between  the  density-ratio  matching  approach 
and  a  moment-matching  approach.  To  this  end,  we  focus  on  the  ideal  situation  where  the 
true  density-ratio  function  r*  is  included  in  the  density-ratio  model  r. 

The  non-linear  version  of  finite-order  moment  matching  (see  Section  2.1.1)  learns  the 
density-ratio  model  r  so  that  the  following  criterion  is  minimized: 


J  4>(x)r(x)p*de(x) dx 


2 


0(*Ku(*)da! 


5 


where  <fi(x)  :  — y  is  some  vector-valued  function.  Under  the  assumption  that 

the  density-ratio  model  r  can  represent  the  true  density-ratio  r*,  we  have  the  following 
estimation  equation: 


j  4>(x)r(x)p*de(x) dx  -  J  (f){x)p*mi{x)dx  =  0m,  (35) 

where  0m  denotes  the  m-dimensional  vector  with  all  zeros. 

On  the  other  hand,  the  density-ratio  matching  approach  described  in  Section  3.1  learns 
the  density-ratio  model  r  so  that  the  following  criterion  is  minimized: 

I P*de(x)df(r(x))r(x)dx  -  J p*de(x)f(r(x))dx  -  j  p*nu(x)df(r(x))dx. 

Taking  the  derivative  of  the  above  criterion  with  respect  to  parameters  in  the  density-ratio 
model  r  and  equate  it  to  zero,  we  have  the  following  estimation  equation: 

J p*de(x)r(x)X7r(x)d2 f(r(x))dx  —  j  p^u(x)Vr(x)d2  f(r(x))dx  =  0b , 

where  V  denotes  the  differential  operator  with  respect  to  parameters  in  the  density-ratio 
model  r,  and  b  is  the  number  of  parameters.  This  implies  that  putting 

4>{x)  =  Vr(cc)<92  f(r(x)) 

in  Eq.(35)  gives  the  same  estimation  equation  as  density-ratio  matching,  resulting  in  the 
same  optimal  solution. 

3.4  Basu’s  Power  Divergence  for  Robust  Density-Ratio  Estima¬ 
tion 

Finally,  we  introduce  a  new  instance  of  density-ratio  matching  based  on  Basu’s  power 
divergence  (BA  divergence;  Basu  et  ah,  1998). 
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3.4.1  Derivation 

For  a  >  0,  let 


m  = 


ti+a  - 1 


a 


Then  BR  (24)  is  reduced  to  the  BA  divergence: 


BA^(f*||f)  :=ta(t-t*) 


t*(ta  -  ( t*)a ) 


a 


Following  Eqs.(25)  and  (26),  let  us  denote  BA^  without  an  irrelevant  constant  term  by 
BAq  (r)  and  its  empirical  approximation  by  BAa  (r),  respectively: 


BAq  (r)  :=  /  p*de(x)r(x)a+1dx  -  (  1  +  f  p*nu(x)r(x)a dx  +  , 


-i  D'de  /  1  \  1  '  fcnu  -i 

BA Q  (r)  :=  —  V  r(*f)Q+1  -  (  1  +  -  )  - V  r«u)“  +  - . 

n(ie  ^  J  \  «/  « 

J=1  x  /  2=1 


The  density-ratio  model  r  is  determined  so  that  BAa(r)  is  minimized. 

When  a  =  1,  the  BA  divergence  is  reduced  to  the  twice  SQ  divergence  (see  Section  2.4): 


BA1  =  2SQ. 


Similarly,  the  fact 


lim -  =  log  t 

«— >o  a 

implies  that  the  BA  divergence  tends  to  the  UKL  divergence  as  a  — >  0  (see  Section  3.2.4): 

-i  ^de  -|  4lnu 

lim  BAa  (r)  =  - r(x^e) - log  rix™1)  =  UKL  (r)  . 

a~' ^  ^de  ,  ^nu  •  . 

j=l  i=l 

Thus,  the  BA  divergence  essentially  includes  the  SQ  and  UKL  divergences  as  special  cases, 
and  is  substantially  more  general. 

3.4.2  Robustness 

Let  us  take  the  derivative  of  BAa  (r)  with  respect  to  parameters  included  in  the  density- 
ratio  model  r,  and  equate  it  to  zero.  Then  we  have  the  following  estimation  equation: 
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where  V  is  the  differential  operator  with  respect  to  parameters  in  the  density-ratio  model 
r,  6  denotes  the  number  of  parameters,  and  0^  denotes  the  6-dimensional  vector  with  all 
zeros. 

As  explained  in  Section  3.4.1,  the  BA  method  with  a  — >  0  corresponds  to  KLIEP 
(using  the  UKL  divergence).  According  to  Eq.(31),  the  estimation  equation  of  KLIEP  is 
given  as  follows  (this  also  agrees  with  Eq.(36)  with  a  =  0): 


1 

^•de 


nde 

V  Vr(xf’) 

3  = 1 


1 

Wnu 


Unu 

5>(*rr1vr(!t?u)  =  ok. 
i=l 


Comparing  this  with  Eq.(36),  we  see  that  the  BA  method  can  be  regarded  as  a  weighted 
version  of  KLIEP  according  to  r(a^e)“  and  r(xfu)a.  When  r(a;)-e)  and  r(xfu)  are  less  than 
1,  the  BA  method  down-weights  the  effect  of  those  samples.  Thus,  ‘outlying’  samples 
relative  to  the  density-ratio  model  r  tend  to  have  less  influence  on  parameter  estimation, 
which  will  lead  to  robust  estimators  (Basu  et  ah,  1998). 

Since  LSIF  corresponds  to  a  =  1,  LSIF  is  more  robust  against  outliers  than  KLIEP 
(which  corresponds  to  a  — >  0)  in  the  above  sense,  and  BA  with  a  >  1  would  be  even  more 
robust. 


3.4.3  Numerical  Examples 

Here  we  illustrate  the  behavior  of  the  robust  density-ratio  estimation  method  based  on 
the  BA  divergence  using  artificial  data  sets. 

Let  the  numerator  and  denominator  densities  be  defined  as  follows  (Figure  2(a)): 


p*nu(x)  =  0.71V  (*;  0,  0.252)  +  0.31V  (*;  1,  0.52)  and  p^e(x)—N(x]  0,12), 

where  N(x;  /i,  a2)  denotes  the  Gaussian  density  with  mean  /i  and  variance  cr2,.  We  draw 
rinil  =  Ude  =  300  samples  from  each  density,  which  are  illustrated  in  Figure  2(b). 

Here,  we  employ  the  Gaussian-kernel  density-ratio  model  (17),  and  determine  the 
model  parameters  so  that  BAQ  (r)  with  a  quadratic  regularizer  is  minimized  under  the 
non- negativity  constraint: 

fide  /  Him  \  a+1 

(-|  \  -i  7lnu  /  Tlnu  \  ^ 

+xo^e 

>  06.  (37) 


min 

0eRb 


s.t. 


Note  that  this  optimization  problem  is  convex  for  0  <  a  <  1.  In  our  implementation, 
we  solve  the  above  optimization  problem  by  gradient-projection,  i.e.,  the  parameters  are 
iteratively  updated  by  gradient  descent  with  respect  to  the  objective  function,  and  the 
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(a)  Numerator  and  denominator  density  functions,  (b)  Numerator  and  denominator  sample  points 


x 


(c)  True  and  estimated  density-ratio  functions 
Figure  2:  Numerical  examples. 


solution  is  projected  back  to  the  feasible  region  by  rounding-up  negative  parameters  to 
zero.  Before  solving  the  optimization  problem  (37),  we  run  uLSIF  (see  Section  2.4.3) 
and  obtain  cross-validation  estimates  of  the  Gaussian  width  cr  and  the  regularization 
parameter  A.  We  then  fix  the  Gaussian  width  and  the  regularization  parameter  in  the  BA 
method  to  these  values,  and  solve  the  optimization  problem  (37)  by  gradient-projection 
with  6  =  lb/b  as  the  initial  solution. 

Figure  2(c)  shows  the  true  and  estimated  density- ratio  functions  by  the  BA  methods 
for  a  =  0, 1,  2,  3.  The  true  density-ratio  function  has  two  peaks — higher  one  at  x  =  0  and 
lower  one  at  around  x  =  1.2.  The  graph  shows  that,  as  a  increases,  estimated  density- 
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ratio  functions  tend  to  focus  on  approximating  the  higher  peak  and  ignore  the  lower  peak. 
Thus,  if  numerator  samples  drawn  from  the  right-hand  Gaussian  (i.e. ,  N  (x;  1,0. 52))  are 
regarded  as  outliers,  the  BA  methods  with  larger  a  are  more  robust  against  these  outliers. 

We  further  investigate  the  issue  of  robustness  against  outliers  more  systematically.  Let 

Pnu(^)  =  (!  -  P)N  (u  0,  0.252)  +  pN  (x;  1,  0.52)  , 

P*de(x )  =  (!  -  v)N(x]  0,  l2)  +  rjN (x;  0,  0.52), 

where  p  and  rj  are  the  numerator  and  denominator  outlier  ratio,  respectively;  samples 
drawn  from  the  second  densities  N  (x;l,0.52)  and  lV(x;0,0.52)  are  regarded  as  outliers. 
Let  nnu  =  ride  =  300,  and  we  evaluate  how  the  accuracy  of  density-ratio  estimation  is  in¬ 
fluenced  by  outliers.  In  the  first  set  of  experiments,  we  fix  the  denominator  outlier  ratio  to 
rj  =  0  (i.e.,  no  outlier)  and  change  the  numerator  outlier  ratio  as  p  =  0,  0.05,  0.1, ... ,  0.3. 
In  the  second  set  of  experiments,  we  fix  the  numerator  outlier  ratio  to  p  =  0  (i.e.,  no 
outlier)  and  change  the  denominator  outlier  ratio  as  p  =  0,  0.05,  0.1, ... ,  0.3.  The  approx¬ 
imation  error  of  a  density-ratio  estimator  ?  is  evaluated  by  UKL  (?)  defined  by  Eq.(30), 
which  correspond  to  the  BA  divergence  with  a  — »  0  as  explained  in  Section  3.4.1.  Here, 
UKL  (?)  is  numerically  approximated  using  1000  samples  independently  taken  following 
p*u(x)  with  p  —  0  (i.e.,  no  outliers)  and  1000  samples  independently  taken  following 
Pde(x)  with  V  =  0  (i.e.,  no  outliers).  Note  that  these  samples  are  not  used  for  obtaining 
a  density-ratio  estimator  ?.  For  obtaining  density-ratio  estimators,  we  use  off-the-shelf 
MATLAB  implementation  of  KLIEP  (which  corresponds  to  the  BA  method  with  a  — >  0) 
and  uLSIF  (which  corresponds  to  the  BA  method  with  a  =  1)  available  from  the  web 
(see  Section  2.3  and  Section  2.4).  This  renders  a  more  practical  setup  of  density-ratio 
estimation. 

The  median  and  standard  deviation  of  UKL  values  for  KLIEP  and  uLSIF  over  100 
runs  are  plotted  in  Figure  3.  Note  that  the  standard  deviation  is  divided  by  5  in  the 
plots  for  clear  visibility.  The  graphs  show  that  KLIEP  works  better  than  uLSIF  when  the 
outlier  ratio  is  small.  This  is  natural  consequences  since  KLIEP  tries  to  minimizes  UKL 
(see  Section  3.2.4).  However,  as  the  outlier  ratio  increases,  the  approximation  error  of 
KLIEP  grows  rapidly.  On  the  other  hand,  the  approximation  error  of  uLSIF  grows  rather 
mildly,  showing  the  robustness  of  uLSIF  against  outliers.  This  phenomenon  well  agrees 
with  the  argument  in  Section  3.4.2. 

However,  the  error  bars  of  uLSIF  are  much  larger  than  KLIEP.  This  would  be  caused 
by  the  fact  that  the  ‘effective’  number  of  samples  used  in  uLSIF  is  smaller  than  that  of 
KLIEP  due  to  the  down-weighting  effect  explained  in  Section  3.4.2.  Thus,  the  statistical 
efficiency  of  uLSIF  would  be  lower  than  KLIEP,  which  is  a  common  trade-off  in  robust 
statistical  methods  (Huber,  1981). 

Another  observation  from  these  experimental  results  is  that  numerator  outliers  more 
strongly  degrade  the  accuracy  of  KLIEP  than  denominator  outliers. 
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Numerator  outlier  ratio  p  Denominator  outlier  ratio  r| 


(a)  The  numerator  outlier  ratio  p  is  changed  while  (b)  The  denominator  outlier  ratio  rj  is  changed 
the  denominator  outlier  ratio  is  fixed  to  77  =  0  (i.e. ,  while  the  numerator  outlier  ratio  is  fixed  to  p  =  0 
no  outliers).  (i.e.,  no  outliers). 

Figure  3:  The  median  and  standard  deviation  of  UKL  values  for  KLIEP  and  uLSIF  over 
100  runs  when  the  number  of  outlier  samples  is  changed.  For  clear  visibility,  the  standard 
deviation  is  divided  by  5  in  the  plots. 

4  Conclusions 

In  this  paper,  we  addressed  the  problem  of  density-ratio  estimation.  We  first  provided  a 
comprehensive  review  of  density-ratio  estimation  methods,  including  the  moment  match¬ 
ing  approach  (Section  2.1),  the  probabilistic  classification  approach  (Section  2.2),  the  den¬ 
sity  matching  approach  (Section  2.3),  and  the  density-ratio  fitting  approach  (Section  2.4). 
Then  we  proposed  a  novel  framework  of  density-ratio  estimation  by  density-ratio  fitting 
under  the  Bregman  divergence  (Section  3.1).  We  showed  that  our  novel  framework  ac¬ 
commodates  the  existing  approaches  reviewed  above,  and  is  substantially  more  general. 
Within  this  novel  framework,  we  developed  a  robust  density-ratio  estimation  method 
based  on  Basu’s  power  divergence. 

The  power  divergence  method  allows  us  to  systematically  compare  the  robustness  of 
the  density  matching  approach  based  on  the  KL  divergence  (KLIEP)  and  the  density-ratio 
fitting  approach  based  on  the  Pearson  divergence  (uLSIF).  However,  the  robustness  of  the 
probabilistic  classification  approach  is  still  unknown,  which  needs  to  be  investigated  in 
our  future  work. 

Experimentally,  we  observed  that  numerator  outliers  tend  to  more  significantly  de¬ 
grade  the  accuracy  of  KLIEP  than  denominator  samples,  while  uLSIF  is  reasonably  stable 
for  both  cases.  It  is  interesting  to  investigate  this  experimental  tendency  theoretically, 
together  with  convergence  properties  of  the  robust  method. 

I11  the  power  divergence  method,  the  choice  of  robustness  parameter  a  is  an  open  issue. 
Although  there  seems  to  be  no  universal  way  for  this  (Basu  et  al.,  1998;  Jones  et  ah,  2001; 
Fujisawa  and  Eguchi,  2008),  a  practical  approach  would  be  to  use  cross-validation  over  a 
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fixed  divergence  such  as  the  squared  distance. 
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Abstract:  Mutual  information  (MI)  is  useful  for  detecting  statistical  independence  between 
random  variables,  and  it  has  been  successfully  applied  to  solving  various  machine  learning 
problems.  Recently,  an  alternative  to  MI  called  squared-loss  MI  (SMI)  was  introduced. 
While  ordinary  MI  is  the  Kullback-Leibler  divergence  from  the  joint  distribution  to  the 
product  of  the  marginal  distributions,  SMI  is  its  Pearson  divergence  variant.  Because 
both  the  divergences  belong  to  the  /-divergence  family,  they  share  similar  theoretical 
properties.  However,  a  notable  advantage  of  SMI  is  that  it  can  be  approximated  from 
data  in  a  computationally  more  efficient  and  numerically  more  stable  way  than  ordinary 
MI.  In  this  article,  we  review  recent  development  in  SMI  approximation  based  on  direct 
density-ratio  estimation  and  SMI-based  machine  learning  techniques  such  as  independence 
testing,  dimensionality  reduction,  canonical  dependency  analysis,  independent  component 
analysis,  object  matching,  clustering,  and  causal  inference. 

Keywords:  squared-loss  mutual  information;  Pearson  divergence;  density-ratio  estimation; 
independence  testing;  dimensionality  reduction;  independent  component  analysis;  object 
matching;  clustering;  causal  inference;  machine  learning 


1.  Introduction 

Mutual  information  (MI)  [1,2]  for  random  variables  X  and  Y  is  defined  as: 


where  p(x,  y )  is  the  joint  probability  density  of  X  and  Y ,  and  p(x)  and  p(y)  are  the  marginal  probability 
densities  of  X  and  Y,  respectively  (Precisely,  p(x.  y),  p(x),  and  p(y)  are  different  functions  and 
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thus  should  be  denoted,  e.g.,  by  px,\(x,y),  px(x),  and  p\{y),  respectively.  However,  we  use  the 
simplified  notations  for  the  sake  of  brevity).  Statistically,  MI  can  be  regarded  as  the  Kullback-Leibler 


divergence  [3]  from  the  joint  density  p(x,  y )  to  the  product  of  the  marginals  p(x)p(y),  and  thus  can  be 


regarded  as  a  measure  of  statistical  dependency  between  X  and  Y.  Estimation  of  MI  from  data  samples 
has  been  one  of  the  major  challenges  in  information  science  and  various  approaches  have  been  developed 
thus  far. 

The  most  naive  approach  to  approximating  MI  from  data  would  be  to  use  a  non-parametric  density 
estimator  such  as  kernel  density  estimation  (KDE)  [4],  i.e.,  the  densities  p(x,  y),p(x),  and  p(y)  included 
in  MI  are  separately  estimated  from  samples,  and  the  estimated  densities  are  used  for  approximating  MI. 
However,  density  estimation  is  known  to  be  a  hard  problem  [5]  and  division  by  estimated  densities 
tends  to  magnify  the  estimation  error.  Therefore,  the  KDE-based  MI  approximator  may  not  be  reliable 
in  practice. 

Another  approach  uses  histogram-based  density  estimators  with  data-dependent  partitioning.  In  the 
context  of  estimating  the  Kullback-Leibler  divergence  [3],  histogram-based  methods  have  been  studied 
thoroughly  and  their  consistency  has  been  established  [6-8].  However,  the  rate  of  convergence  has 
not  been  elucidated  yet,  and  such  histogram-based  methods  are  strongly  influenced  by  the  curse  of 
dimensionality.  Thus,  these  methods  may  not  be  reliable  in  high-dimensional  problems. 

MI  can  be  expressed  in  terms  of  the  entropies  as: 


MI(X,  Y)  =  H(X)  +  H{Y)  -  H(X,  Y ) 


where  H(X)  denotes  the  entropy  of  X: 


Based  on  this  expression,  the  nearest  neighbor  distance  has  been  used  for  approximating  MI  [9].  Such  a 
nearest  neighbor  approach  was  shown  to  perform  better  than  the  naive  KDE-based  approach  [10],  given 
that  the  number  k  of  nearest  neighbors  is  chosen  appropriately — a  small  (large)  k  yields  an  estimator  with 
small  (large)  bias  and  large  (small)  variance.  However,  appropriately  determining  the  value  of  k  so  that 
the  bias-variance  trade-off  is  optimally  controlled  is  not  straightforward  in  the  context  of  MI  estimation. 
A  similar  nearest- neighbor  idea  has  been  applied  to  Kullback-Leibler  divergence  estimation  [11],  whose 
consistency  has  been  proved  for  finite  k — this  is  an  interesting  result  since  Kullback-Leibler  divergence 
estimation  is  consistent  even  when  density  estimation  is  not  consistent.  However,  the  rate  of  convergence 
seems  to  be  still  an  open  research  issue. 

Approximation  of  the  entropies  based  on  the  Edgeworth  expansion  has  also  been  explored  in 
the  context  of  MI  estimation  [12].  Such  a  method  works  well  when  the  target  density  is  close  to 
Gaussian.  However,  if  the  target  density  is  far  from  Gaussian,  the  Edgeworth  expansion  method  is 
no  longer  reliable. 

More  recently,  an  MI  approximator  via  direct  estimation  of  the  density  ratio  ^as  been 

developed  [13],  which  is  based  on  a  Kullback-Leibler  divergence  approximator  via  direct  density-ratio 
estimation  [14-16].  The  MI  approximator  is  given  as  the  solution  of  a  convex  optimization  problem, 
which  tends  to  be  sparse  [14].  A  notable  advantage  of  this  density-ratio  method  is  that  it  does  not 
involve  separate  estimation  of  densities  p(x,y),  p(x),  and  p(y),  and  it  was  proved  to  achieve  the 
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optimal  non-parametric  convergence  rate.  However,  due  to  the  “log”  operation  included  in  MI,  this 
MI  approximator  is  computationally  rather  expensive  and  susceptible  to  outliers  [17,18]. 

To  cope  with  these  problems,  a  variant  of  MI  called  the  squared-loss  mutual  information  (SMI)  [19] 
has  been  explored  recently.  SMI  for  X  and  Y  is  defined  as: 

SMIIX.Y)  , =  \jj  p(x)p(y)  -  l)  dxdy 

SMI  is  the  Pearson  divergence  [20]  from  the  joint  density  p(x,  y )  to  the  product  of  the  marginals 
p(x)p(y).  It  is  always  non-negative  and  it  vanishes  if  and  only  if  X  and  Y  are  statistically  independent. 
Note  that  both  the  Pearson  divergence  and  the  Kullback-Leibler  divergence  belong  to  the  class  of 
Ali-Silvey-Csiszar  divergences  (which  is  also  known  as  /-divergences)  [21,22],  meaning  that  they  share 
similar  properties. 

In  a  similar  way  to  ordinary  MI,  SMI  can  be  approximated  accurately  via  direct  estimation  of  the 
density  ratio  [19],  which  is  based  on  a  Pearson  divergence  approximator  via  direct  density-ratio 

estimation  [16,23].  This  SMI  approximator  has  various  desirable  properties:  For  example,  it  was  proved 
to  achieve  the  optimal  non-parametric  convergence  rate  [24],  its  solution  can  be  obtained  analytically 
just  by  solving  a  system  of  linear  equations,  it  has  superior  numerical  properties  [25],  and  it  is  robust 
against  outliers  [17,18]. 

In  particular,  the  property  of  the  SMI  approximator  that  an  analytic  solution  is  available  is  highly 
useful  in  machine  learning,  because  this  allows  explicit  computation  of  the  derivative  of  the  SMI 
approximator  with  respect  to  another  parameter.  For  example,  in  supervised  dimensionality  reduction, 
linear  transformation  U  for  input  x  is  optimized  so  that  the  transformed  input  Ux  has  the  highest 
dependency  on  output  y.  In  this  context,  the  derivative  of  the  SMI  estimator  between  Ux  and  y  with 
respect  to  transformation  U  can  be  exploited  for  optimizing  transformation  U.  On  the  other  hand, 
such  derivative  computation  is  not  straightforward  for  the  MI  estimator  whose  solution  is  obtained  via 
numerical  optimization. 

The  purpose  of  this  article  is  to  review  recent  development  in  SMI  approximation  based  on  direct 
density-ratio  estimation  and  SMI-based  machine  learning  techniques.  The  remainder  of  this  paper  is 
structured  as  follows.  After  reviewing  the  SMI  approximator  based  on  direct  density-ratio  estimation 
in  Section  2,  we  illustrate  in  Section  3  how  the  SMI  approximator  can  be  utilized  for  solving 
various  machine  learning  tasks  such  as:  independence  testing  [26],  feature  selection  [19,27],  feature 
extraction  [28,29],  canonical  dependency  analysis  [30],  independent  component  analysis  [31],  object 
matching  [32],  clustering  [33,34],  and  causality  learning  [35]. 

2.  Definition  and  Estimation  of  SMI 

In  this  section,  we  review  the  definition  of  SMI  and  its  approximator  based  on  direct 
density-ratio  estimation. 
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2.1.  Definition  of  SMI 

Let  us  consider  two  random  variables  X  e  X  and  Y  E  y,  where  X  and  y  are  domains  of  X  and  Y . 
respectively.  Let  p(x,  y)  be  the  joint  probability  density  of  X  and  Y.  and  p(x)  and  p(y)  be  the  marginal 
probability  densities  of  X  and  Y,  respectively.  The  squared-loss  mutual  information  (SMI)  [19]  for  X 
and  Y  is  defined  as: 

SMI(X.y)  ^IjJ p(x)p(y)  -  l)  dxd y  (1) 

SMI  is  always  non-negative  and  it  takes  zero  if  and  only  if  X  and  Y  are  statistically  independent.  Hence, 
SMI  can  be  used  for  detecting  statistical  independence  between  X  and  Y . 

Below,  we  consider  the  problem  of  estimating  SMI  from  paired  samples  {(a?i?  yi)}i=1  drawn 
independently  from  the  joint  distribution  with  density  p(x,y). 


2.2.  Least-Squares  Estimation  of  SMI 

Here,  we  review  the  basic  idea  and  theoretical  properties  of  the  SMI  approximator  called  least-squares 
mutual  information  (LSMI)  [19]. 

2.2.1.  SMI  Approximation  via  Direct  Density-Ratio  Estimation 


The  basic  idea  of  LSMI  is  to  directly  estimate  the  following  density-ratio  function  without  going 
through  density  estimation  of  p{x ,  y),  p(x),  and  p(y): 


r(x,y)  ■= 


p(x,y) 

p(x)p{y) 


(2) 


Let  g(x,y)bea  model  of  the  density  ratio  r(x,y).  In  LSMI,  the  model  is  learned  so  that  the  following 
squared-error  J  is  minimized: 


J{g)-=  2  /  j  {9(xiV)  -r(®,y))  p{x)p(y)dxdy 

=  \  JJ  g(x>  y)2p(x)p(y)dxdy  -  j  j  g(x,  y)p(x,  y)dxdy  +  C 
where  C  is  a  constant  defined  by: 


(3) 


C:=2 


JJ  r(x,y)p(x,y)dxdy 


Since  J  contains  the  expectations  over  unknown  densities  p{x)p{y)  and  p(x.  y),  the  expectations  are 
approximated  by  empirical  averages.  Then  the  LSMI  optimization  problem  is  given  as  follows: 


g  :=  argnrm 
g&Q 


^  n  i  n 

d(xii  Vj)2  -  -  g(xi,  Vi) 


2  n2 


i,j= 1 


i= 1 


(4) 


where  Q  is  a  function  space  from  which  g  is  searched. 
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Finally,  the  SMI  approximator  called  LSMI  is  given  as: 

n 

LSMI({(®i,yi)}?=1)  :=  —  ^g{xuyi)  -  - 


(5) 


2  n2 


Xi,  Vj 


i,j=t 


n 


i=l 


Xi,  Vi 


1 

2 


(6) 


Equation  (5)  would  be  the  simplest  SMI  approximator,  while  Equation  (6)  is  suitable  for  theoretical 
analysis  because  this  corresponds  to  the  negative  of  the  objective  function  (4)  up  to  the  constant  1/2. 
These  estimators  are  derived  based  on  the  following  equivalent  expressions  of  SMI: 


SMI(X,  Y) 


r(x,y)p(x,y)dxdy  -  - 
/  r(x,y)2p(x)p(y)dxdy 


r(x,y)p(x,y)dxdy  -  - 


(V) 

(8) 


Equation  (7)  is  obtained  by  expanding  the  squared  term  in  Equation  (1),  applying  Equation  (2)  to 
the  squared  density-ratio  term  once,  and  showing  that  the  cross-term  and  the  remaining  terms  are 
—  1  and  1/2,  respectively.  Equivalence  between  Equations  (7)  and  (8)  can  be  confirmed  by  applying 
Equation  (2)  to  the  first  term  in  Equation  (8)  once.  Note  that  Equation  (8)  can  also  be  obtained  via  the 
Legendre-Fenchel  duality  of  Equation  (1),  implying  that  the  optimization  problem  (4)  corresponds  to 
approximately  maximizing  the  Legendre-Fenchel  lower-bound  [15]. 


2.2.2.  Convergence  Analysis 


Here  we  briefly  review  theoretical  convergence  properties  of  LSMI. 

First,  let  us  consider  the  case  where  the  function  class  Q  from  which  the  function  g  is  searched  is  a 
parametric  model: 

Q  =  { ge(x,y )  I  6  e  0  c  Rb} 

Suppose  that  the  true  density-ratio  r  is  contained  in  the  model  Q,  i.e.,  there  exists  6*  (e  0)  such  that: 
r  =  ge*.  Then,  it  was  shown  [28]  that,  under  the  standard  regularity  conditions  for  consistency  [for 
example,  see  Section  3.2.1  of  36],  it  holds  that: 

LSMT({(^,^)}]Li)  -  SMI(X,  Y)  =  O^n-1'2) 


where  Op  denotes  the  asymptotic  order  in  probability.  This  shows  that  LSMI'  retains  the  optimality 
in  terms  of  the  order  of  convergence  in  n,  because  Op(n  l/2)  is  the  optimal  convergence  rate  in  the 
parametric  setup  [37]. 

Next,  we  consider  non-parametric  cases  where  the  function  class  Q  is  a  reproducing  kernel  Hilbert 
space  [38]  on  X  x  y.  Let  us  consider  a  non-parametric  version  of  the  LSMI  optimization  problem: 


g  :=  argnnn 
see 


n  n  . 

^gfyXi^yj)2  -  -^2g(xi,yi)  + 


2  n2 


*7=1 


i=  1 
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where  ||  •  ||  g  denotes  the  norm  in  the  reproducing  kernel  Hilbert  space  Q .  In  the  above  optimization 
problem,  a  regularizer  \\g\\g  is  included  to  avoid  overfitting  and  An  >  0  is  the  regularization  parameter. 

Suppose  that  the  true  density-ratio  function  r  is  contained  in  the  function  space  Q  and  is  bounded 
from  above.  Then,  it  was  shown  [28]  that,  if  An  — >■  0  and  Anx  =  o(n2/A2+7))  where  7  (0  <  7  <  2) 
denotes  a  complexity  measure  of  the  function  space  Q  based  on  the  bracketing  entropy  (The  larger  the 
value  of  7  is,  the  more  complex  the  function  space  Q  is)  [see  p.83  of  36],  it  holds  that: 

LSMI/({(a;i,2/,)}r=1)  -  SMI (X,Y)  =  Op(max{Xn,  n"1/2))  (9) 

The  conditions  \n  — >■  0  and  A"1  =  o(n2/  -2+'7'1)  roughly  mean  that  the  regularization  parameter  \n 
should  be  sufficiently  small  but  not  too  small.  Equation  (9)  means  that  the  convergence  rate  of  the  non- 
parametric  version  can  also  be  Op(rr122)  for  an  appropriate  choice  of  An,  but  the  non-parametric  method 
requires  a  milder  model  assumption.  According  to  [15],  the  above  convergence  rate  is  the  minimax 
optimal  rate  under  some  setup.  Thus,  the  convergence  property  of  the  above  non-parametric  method 
would  also  be  optimal  in  the  same  sense. 


2.3.  Practical  Implementation  ofLSMI 

We  have  seen  that  LSMI  has  desirable  convergence  properties.  Here  we  review  practical 
implementation  ofLSMI.  A  MATLAB®  implementation  ofLSMI  is  publicly  available  [39]. 

2.3.1.  LSMI  for  Linear-in-Parameter  Models 


Let  us  approximate  the  density  ratio  Equation  (2)  using  the  following  linear-in-parameter  model: 

b 

fje(x,  y)  =  Of(pf;(x,  y)  =  GT  4>(x ,  y)  (10) 

1= 1 


where  G  =  (6i, . . . ,  8b)T  are  parameters,  <fi(x ,  y)  =  {4>i(x,  y), . . . ,  cj)b(x ,  y))T  are  fixed  basis  functions, 
and  T  denotes  the  transpose.  Practical  choices  of  the  basis  functions  will  be  explained  in  Section  2.3.2.  . 
Lor  this  model,  the  LSMI  optimization  problem  with  an  (^-regularizer  is  expressed  as: 


Q  :=  argmin 

0eM6 


-gthg  -  eTh  +  -eTe 

2  2 


where  A  >  0  is  the  regularization  parameter  that  controls  the  strength  of  regularization,  H  is  the  6x6 
matrix  defined  by: 


II  :  = 


i,j= 1 


and  h  is  the  6-dimensional  vector  defined  by: 


h  :=  -Y 1<l>{xi,yi) 


i=i 
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The  solution  6  can  be  analytically  obtained  as: 

d  =  {H  +  \Ib)~1h  (11) 

where  Ib  is  the  6-dimensional  identity  matrix.  Finally,  LSMI  is  also  given  analytically  as: 

LSM({(*j,»j)}E.1)  =  ifcre-i  (12) 

or 

LSMI/({(ajj,  yi)}ni=1)  =  -\^HG  +  hT0-^  (13) 

Some  elements  of  6  may  take  negative  values  in  the  above  formulation,  which  can  lead  to  negative 
density -ratio  values  and  negative  LSMI  values.  Such  negative  values  may  be  rounded  up  to  zero  if 
necessary,  although  this  does  not  happen  for  sufficiently  large  n.  Another  option  is  to  explicitly  impose 
the  non-negativity  constraint  Oi, . . . ,  9b  >  0  on  the  optimization  problem.  However,  by  this  modification, 
the  solution  can  no  longer  be  obtained  analytically,  but  only  numerically  using  a  quadratic  program 
solver.  (In  this  case,  if  the  £2-regularizer  is  replaced  with  the  ()  -rcgularizer,  the  regularization  path 
[40,41] — i.e.,  solutions  for  all  different  regularization  parameter  values — can  be  computed  efficiently 
without  a  quadratic  program  solver  just  by  solving  systems  of  linear  equation  [23].) 

2.3.2.  Design  of  Basis  Functions 


The  practical  accuracy  of  LSMI  depends  on  the  choice  of  basis  functions  in  the  model  Equation  (10). 
A  typical  choice  is  a  non-parametric  kernel  model,  i.e.,  setting  the  number  of  basis  function  to  b  =  n  and 
the  l- th  basis  function  to  <j)g(x,  y )  =  K(x,  xg)L(y,  yg)\ 


go  O,  y)  =  xe)L(V'  vt) 

e=i 


(14) 


where  K(x,  x')  and  L(y ,  y')  are  kernel  functions  for  x  and  y.  respectively.  If  n  is  too  large,  b  may  be 
set  to  be  smaller  than  n  and  choose  a  subset  of  data  points  {(ajj,  yi)}2=  i  as  kernel  centers. 

For  real  vector  x  £  Wl,  we  may  practically  use  the  Gaussian  kernel  for  K(x,  x')  after  element-wise 
variance  normalization: 


K(x,  x')  =  exp 


\x  —  X 


/ 112 


2a2 


where  crx  >  0  is  the  Gaussian  width.  When  x  is  a  non-vectorial  structured  object  such  as  a  string,  a  tree, 
and  a  graph,  we  may  employ  a  kernel  function  defined  for  such  structured  data  [42]. 

In  the  (multi-output)  regression  scenario  where  y  is  a  real  vector,  the  Gaussian  kernel  may  also  be 
used  for  L(y.  yr)  after  element-wise  variance  normalization: 


I  y  -  y 


/  II 2 


2ol 


L(V,y‘)  =  exp 
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where  ay  >  0  is  the  Gaussian  width.  In  the  multi-class  classification  scenario  where  y  E  {1, . . . ,  c}  and 
c  denotes  the  number  of  classes,  we  may  use  the  delta  kernel  for  L(y,  y')\ 


L(y,y') 


1  if  y  =  y' 
0  if  y^y' 


Note  that,  in  the  classification  case  with  the  delta  kernel,  the  LSMI  solution  can  be  computed  efficiently 
in  a  class-wise  manner  [33].  In  the  multi-label  classification  scenario  where  y  E  {0, 1}C  and  c  denotes 
the  number  of  labels,  we  may  use  the  normalized  linear  kernel  function  [43]  for  y : 


L{y,y') 


( y  -  y)T(y'  -  y) 

I|y-y||||y'-y1l 


where  y  —  -  CLi  2/*  ’s  the  sample  mean. 


2.3.3.  Model  Selection  by  Cross-Validation 


Most  of  the  above  kernels  include  tuning  parameters  such  as  the  Gaussian  width,  and  the  practical 
performance  of  LSMI  depends  on  the  choice  of  such  kernel  parameters  and  the  regularization  parameter 
A.  Model  selection  of  LSMI  is  possible  based  on  cross-validation  with  respect  to  the  criterion  J  defined 
by  Equation  (3). 

More  specifically,  the  sample  set  V  =  { (x,,.  y,)}'-=i  is  divided  into  M  disjoint  subsets  {'Dm}^=] . 
Then  the  LSMI  solution  gm{x)  is  obtained  using  V\Dm  {i.e.,  all  samples  without  Vm),  and  its  J-score 
for  the  hold-out  samples  Vm  is  computed  as: 


tCV 

J  rm  • 


2 1  XL 


Y  9m(x,yf 


x,y£T>r, 


Y  9m{x,y) 

(x,y)eT>m 


where  \Vm\  denotes  the  number  of  elements  in  the  set  Vm.  yeDn  denotes  the  summation  over  all 
combinations  of  x  and  y  in  Vm  (and  thus  \Vm\ 2  terms),  while  J2(x  y)&vm  denotes  the  summation  over 
all  pairs  ( x ,  y)  in  Vrn  (and  thus  Vrn  terms).  This  procedure  is  repeated  for  m  =  1, . . . ,  M,  and  the 
average  score, 

?CV  1  Lev 

•  M  m 

m=l 

is  computed.  Finally,  the  model  (the  kernel  parameter  and  the  regularization  parameter  in  the  current 
setup)  that  minimizes  Jcv  is  chosen  as  the  most  suitable  one. 


3.  SMI-Based  Machine  Learning 

In  this  section,  we  show  how  the  SMI  estimator,  LSMI,  can  be  used  for  solving  various  machine 
learning  tasks. 
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3.1.  Independence  Testing 

First,  we  show  how  the  SMI  estimator  can  be  used  for  independence  testing. 

3.1.1.  Introduction 

Identifying  the  statistical  independence  between  random  variables  is  one  of  the  fundamental 
challenges  in  statistical  data  analysis.  A  traditional  independence  measure  between  random  variables 
is  the  Pearson  correlation  coefficient,  which  can  be  used  for  detecting  linear  dependency.  Recently, 
kernel-based  independence  measures  have  been  studied  to  detect  non-linear  dependency.  The 
Hilbert-Schmidt  independence  criterion  (HSIC)  [44]  utilizes  cross-covariance  operators  on  universal 
reproducing  kernel  Hilbert  spaces  (RKHSs)  [45],  which  is  an  infinite-dimensional  generalization  of 
covariance  matrices.  HSIC  allows  efficient  detection  of  non-linear  dependency  by  making  use  of  the 
reproducing  property  of  RKHSs  [38].  However,  HSIC  has  a  weakness  that  its  performance  depends  on 
the  choice  of  RKHSs  and  there  is  no  theoretically  justified  way  to  determine  the  RKHS  properly  thus 
far.  In  practice,  using  the  Gaussian  RKHS  with  width  set  to  the  median  distance  between  samples  is  a 
popular  heuristic  [46],  but  this  does  not  always  work  well. 

To  overcome  the  above  limitations,  an  SMI-based  independence  test  called  least-squares  indepen¬ 
dence  test  (LSIT)  was  proposed  [26].  Below,  we  review  LSIT. 

3.1.2.  Independence  Testing  with  SMI 

Let  x  e  X  be  an  input  feature  and  y  E  y  be  an  output  feature,  which  follow  a  joint  probability 
distribution  with  density  p(x,y).  Suppose  that  we  are  given  a  set  of  independent  and  identically 
distributed  (i.i.d.)  paired  samples  {(ccj,  3/i)}"=1.  The  objective  of  independence  testing  is  to  conclude 
whether  x  and  y  are  statistically  independent  or  not,  based  on  the  samples  {(x,  ,  yt ) }  (L 1 . 

The  SMI-based  independence  test,  where  the  null  hypothesis  is  that  x  and  y  are  statistically 
independent,  is  based  on  the  permutation  test  procedure  [47].  More  specifically,  LSMI  is  first  run 
using  the  original  dataset  V  =  {{x^y^}] l=1,  and  an  SMI  estimate,  LSMI(27),  is  obtained.  Next, 
{?/i}r=  i  are  randomly  permuted  and  a  shuffled  dataset  V  =  { (xt.  v/jjl'L,  is  formed,  where  {yi}\ ?=1 
denote  permuted  samples.  Then  LSMI  is  run  again  using  the  shuffled  dataset  V,  and  an  SMI  estimate 
LSMI (27)  is  obtained.  Note  that  the  random  permutation  eliminates  the  dependency  between  x  and  y  (if 
it  exists),  and  therefore  LSMI (27)  would  take  a  value  close  to  zero.  This  random  permutation  procedure 
is  repeated  many  times,  and  the  distribution  of  LSMI(27)  under  the  null-hypothesis  that  x  and  y  are 
statistically  independent  is  constructed.  Finally,  the  p-value  is  approximated  by  evaluating  the  relative 
ranking  of  LSMI(27)  in  the  distribution  of  LSMI(27). 

This  procedure  is  called  the  least-squares  independence  test  (LSIT)  [26].  A  MATLAB® 
implementation  of  LSIT  is  publicly  available  [48]. 

3.2.  Supervised  Feature  Selection 

Next,  we  show  how  the  SMI  estimator  can  be  used  for  supervised  feature  selection. 
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3.2.1.  Introduction 

The  objective  of  supervised  learning  is  to  learn  an  input-output  relation  from  input-output  paired 
samples.  However,  when  the  dimensionality  of  input  vectors  is  large,  using  all  input  elements  could  lead 
to  a  model  interpretability  problem.  Feature  selection  is  aimed  at  finding  a  subset  of  input  elements  that 
is  useful  for  predicting  output  values  [49]. 

Feature  ranking  is  a  simple  implementation  of  feature  selection  that  ranks  each  feature  according  to 
its  relevance.  In  this  feature  ranking  scenario,  SMI  between  a  single  input  variable  and  an  output  was 
shown  to  be  useful  [19].  However,  feature  ranking  does  not  take  feature  interaction  into  account,  and 
thus  it  is  not  useful  when  each  single  feature  is  not  capable  of  predicting  outputs,  but  multiple  features 
are  necessary  for  a  valid  prediction  of  outputs  (e.g.,  an  XOR  problem).  Two  criteria,  relevancy  and 
redundancy,  are  often  used  to  select  multiple  features  simultaneously:  A  feature  is  said  to  be  relevant  if 
it  can  explain  outputs,  and  features  are  said  to  be  redundant  if  they  are  similar.  Ideally,  we  want  to  find  a 
subset  of  features  that  has  high  relevance  and  low  redundancy. 

Another  important  issue  in  feature  selection  is  the  computational  cost:  Naively  selecting  multiple 
features  causes  computational  infeasibility  because  the  number  of  possible  feature  combinations  is 
exponential  with  respect  to  the  number  of  input  features.  To  cope  with  this  problem,  a  computationally 
efficient  method  to  handle  multiple  features  called  the  least  absolute  shrinkage  and  selection  operator 
(LASSO)  [50]  was  proposed.  In  LASSO,  a  predictor  consisting  of  a  weighted  sum  of  each  feature  is 
fitted  to  output  values  using  the  least-squares  method,  while  the  weight  vector  is  confined  in  an  ('x-ball. 
The  £i-ball  restriction  actually  provides  a  notable  property  that  the  solution  is  sparsified,  meaning  that 
some  of  the  weight  parameters  become  exactly  zero.  Thus,  LASSO  automatically  removes  irrelevant 
features  from  its  predictor,  which  can  be  achieved  through  convex  optimization  in  a  computationally 
efficient  way  [51,52]. 

However,  LASSO  can  only  handle  linear  predictors  and  its  feature  selection  characteristic  explicitly 
depends  on  the  squared-loss  function  used  in  the  least-squares  method.  To  go  beyond  these  limitations, 
an  SMI-based  feature  selection  method  called  lx-LSMI  was  proposed  [27].  Below,  we  review  (j-LSMI. 

3.2.2.  Feature  Selection  with  SMI 

The  objective  of  feature  selection  is,  from  input  feature  vector  x  =  . . . ,  o;^)T  e  to  choose 

a  subset  of  its  elements  that  are  useful  for  the  prediction  of  output  y  e  y.  Suppose  that  we  are  given  n 
i.i.d.  paired  samples  {(xj,  yi)}™=  1  drawn  from  a  joint  distribution  with  density  p(x,  y).  Let  wx, . . . ,  wd 
be  feature  weights  for  x^\  . . . ,  x^d\  and  we  learn  the  weights  as: 

(max^  LSMI  ({ {{wxxf \  . . . ,  wdx\d))T ,  yt )  }”=1) 

d 

subject  to  ^  Wi  <  77  and  wx, . . . ,  wd  >  0 
2=1 

where  77  >  0  is  the  regularization  parameter  that  controls  the  number  of  features.  Because  the  sign 
of  feature  weights  is  not  relevant  in  feature  selection,  they  are  restricted  to  be  non-negative.  For 
non-negative  weights,  Ya=i  w> 's  reduced  to  the  (j-norm  of  the  feature  weight  vector  (wx, . . . ,  wd)T . 
The  features  having  zero  weights  are  regarded  as  irrelevant  in  this  formulation. 
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To  compute  the  solution,  a  simple  gradient  ascent  may  be  used,  where  the  weight  vector  is  projected 
onto  the  positive  orthant  of  the  ^-ball  in  each  iteration  to  guarantee  the  feasibility.  This  can  be  performed 
by  first  projecting  the  weight  vector  onto  the  positive  orthant  by  rounding  up  negative  elements  to  zero, 
and  then  projecting  it  onto  the  ^-ball  [54]. 

This  SMI-based  feature  selection  algorithm  is  called  the  i\-LSMl  [27].  A  MATLAB®  implementation 
of  G-LSMI  is  publicly  available  [53]. 

3.3.  Supervised  Feature  Extraction 

While  feature  selection  chooses  a  subset  of  features  for  enhancing  model  interpretability,  feature 
extraction  finds  a  low-dimensional  representation  of  features  that  can  depend  on  all  features  (e.g.,  through 
linear  combination)  for  improving  the  prediction  accuracy.  Here,  we  show  how  the  SMI  estimator  can 
be  used  for  supervised  feature  extraction. 

3.3.1.  Introduction 

The  goal  of  sufficient  dimension  reduction  (SDR)  is  to  map  input  features  to  low-dimensional 
expressions  while  “sufficient”  information  for  predicting  output  values  is  maintained  [55].  Earlier 
SDR  methods  developed  in  the  statistics  community,  such  as  sliced  inverse  regression  [56],  principal 
Hessian  direction  [57],  and  sliced  average  variance  estimation  [58],  rely  on  the  ellipticity  of  the  data 
(e.g.,  Gaussian),  but  such  an  assumption  may  not  be  fulfilled  in  practice.  To  overcome  the  limitations  of 
these  approaches,  kernel  dimension  reduction  (KDR)  was  proposed  [59].  KDR  employs  a  kernel-based 
dependence  measure  that  is  distribution-free,  and  thus  KDR  is  flexible.  However,  it  lacks  systematic 
model  selection  strategies  for  kernel  and  regularization  parameters.  Furthermore,  KDR  scales  poorly  to 
massive  datasets  and  there  is  no  good  way  to  set  an  initial  solution  for  its  gradient-based  optimization.  In 
practice,  many  restarts  from  different  initial  solutions  may  be  needed  for  finding  a  good  local  optimum, 
which  makes  the  entire  procedure  even  slower  and  the  performance  of  dimension  reduction  unreliable. 

To  overcome  the  above  limitations,  an  SMI-based  SDR  method  called  least-squares  dimension 
reduction  (LSDR)  was  proposed  [28].  An  advantage  of  LSDR  is  that  its  tuning  parameters  can 
be  systematically  optimized  based  on  cross-validation.  To  further  address  the  computational  and 
initialization  issues,  a  heuristic  search  strategy  for  LSDR  called  sufficient  component  analysis  (SCA) 
was  proposed  [29].  Below,  we  review  LSDR  and  SCA. 

3.3.2.  Sufficient  Dimension  Reduction  with  SMI 

First,  we  formulate  the  problem  of  SDR  [55].  Let  x  e  Mdx  be  an  input  vector  and  y  e  y  be  an  output. 
The  goal  of  SDR  is  to  find  a  subspace  of  input  domain  Mdx  that  contains  “sufficient”  information  about 
output  y.  We  assume  that  there  exists  an  orthogonal  matrix  U*  €  x dx  for  du  <  dx  such  that 

y  _LL  x  |  U*x  (15) 

That  is,  given  the  projected  feature  U*x,  the  (remaining)  feature  x  is  conditionally  independent  of  output 
y  and  thus  can  be  discarded  without  sacrificing  the  predictability  of  y.  The  objective  of  SDR  is  to  find 
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such  U*  from  n  i.i.d.  paired  samples,  {(a?*,  2/i)}£=i>  drawn  from  a  joint  distribution  with  density  p(x,  y ). 
We  assume  that  the  projection  dimensionality  du  is  known. 

SMI  can  be  used  for  characterizing  the  optimal  projection  matrix  U*  [28].  Indeed,  it  was  shown 
that  inequality, 


SMI(X,r)  >  SM1(UX,  Y) 

holds,  and  the  equality  holds  if  and  only  if  Equation  (15)  holds.  Thus,  maximizing  SMI(t7X,  Y)  with 
respect  to  U  leads  to  U* .  In  practice,  the  following  optimization  problem  is  solved: 

max  LSMI({(E/a;,,  t/,)}”=i) 

ueRduxd-x 

subject  to  UUT  =  Idu 

This  formulation  is  called  least-squares  dimension  reduction  (LSDR)  [28]. 


3.3.3.  Gradient-Based  Subspace  Search 


A  simple  approach  to  solving  the  above  LSDR  optimization  problem  is  the  following 
iterative  procedure: 


•  U  is  updated  to  ascend  the  gradient  of  LSMI({(C/a3j,  y,:)}”=1)  with  respect  to  U. 

•  U  is  projected  onto  the  feasible  region  specified  by  UUT  =  Idu. 


The  gradient  of  LSMI({(£/cCj,  t/j)}”=1)  with  respect  to  U  is  given  by: 


<9LSMI 

dU 


i=  i 


dhj 

dU 


-  ^2  We 

z  i,t'= i 


dHtc 

dU 


If  the  kernel  model  Equation  (14)  with  the  Gaussian  kernel, 

1117a;  -  Ux'\\2 


K(Ux ,  Ux')  =  exp 


is  used,  ^ f  and  °jjir  ^or  —  1,  ■  ■  ■ ,  n)  are  given  by: 
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The  projection  of  U  onto  the  feasible  region  specified  by  UUT  =  Idu  may  be  carried  out  by  the 
Gram-Schmidt  process  [60],  although  this  is  time-consuming. 

An  alternative  way  to  solve  the  LSDR  optimization  problem  is  to  perform  gradient  ascent  on  the 
Grassmann  manifold  [61].  In  the  Euclidean  space,  the  ordinary  gradient  gives  the  steepest  direction. 
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However,  on  a  manifold,  the  natural  gradient  [62]  gives  the  steepest  direction.  The  natural  gradient 
VLSMI(£7)  at  U  is  given  as  follows  [63]: 


VLSMI(U) 


<9LSMI  0LSMI  T  <9LSMI  T 

~au - WTU  u  =  ~MTu^Ul 


where  U±  is  any  (d  —  rri)  x  d  matrix  such  that  U  U  }  is  orthogonal.  Then  the  geodesic  from  U  to 
the  direction  of  the  natural  gradient  VLSMI(C7)  over  the  Grassmann  manifold  can  be  expressed  using 
t  6  K  as: 


U,  := 


o 


dx  du 


exp  t 


-U 


Odu 
slsmit 
L  au 


9LSMI  T  T T 
8U 

Odx-du 


]) 

u 

\) 

u±_ 

where  “exp”  for  a  matrix  denotes  the  matrix  exponential,  and  Odx  is  the  dx  x  dx  zero  matrix.  Thus,  line 
search  along  the  geodesic  in  the  natural  gradient  direction  is  equivalent  to  finding  the  maximizer  from 
{Ut  |  t  >  0}.  For  choosing  the  step  size  of  each  gradient  update,  some  approximate  line  search  method 
such  as  Armijo’s  rule  [64]  or  backtracking  line  search  [51]  may  be  used. 

A  MATLAB®  implementation  of  LSDR  is  publicly  available  [65]. 


3.3.4.  Heuristic  Subspace  Search 


Although  the  natural  gradient  method  is  computationally  more  efficient  than  the  plain  gradient 
method,  it  is  still  time  consuming.  Moreover,  many  restarts  from  different  initial  solutions  may  be 
needed  for  finding  a  good  local  optimum.  Here,  we  introduce  a  heuristic  method  that  can  address 
these  issues  [29]. 

A  key  idea  is  to  use  a  truncated  negative  quadratic  function  called  the  Epanechnikov  kernel  [66]  as  a 
kernel  function  for  Ux: 


K(Ux ,  Ux') 


=  max 


Ux-Ux'  ||2\ 
“2^1  ) 


Let  1(c)  be  the  indicator  function,  i.e.,  1(c)  =  1  if  c  is  true  and  zero  otherwise.  Then,  for  the  above 
kernel  function,  LSMI  can  be  expressed  as: 

LSMI  =  ifr (UDvUt)  -  ^ 

where  tr(-)  denotes  the  trace  of  a  matrix  and  Djj  is  the  dx  x  dx  matrix  defined  by: 


i= i  e=i 


\Uxi  —  Uxi 
2  a? 


<  1  L(yl:ye) 


■JId X  ~  ^5 (Xi~  -  x^> 


T 


Here,  the  fact  that  Op  depends  on  U  is  explicitly  indicated  by  Of  (U). 

If  U  in  Du  is  replaced  by  U',  where  U'  is  a  transformation  matrix  obtained  in  the  previous  iteration, 
the  SMI  estimator  is  simplified  as: 

^tr  (UDu,Ut)  -  l2  (16) 

Because  Du’  is  independent  of  U,  a  maximizer  of  Equation  (16)  with  respect  to  U  can  be  analytically 
obtained  by  (up  \  ■  ■  ■  \udu)T ,  where  {iq}®  [  are  the  du  principal  components  of  D' .  The  same  technique 
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can  also  be  utilized  for  determining  an  initial  transformation  matrix,  by  computing  the  above  solution 
for  U'  =  Idx  (i.e.,  no  dimensionality  reduction). 

The  above  heuristic  search  method  for  LSDR  is  called  sufficient  component  analysis  (SC A)  [29].  A 
MATLAB®  implementation  of  SCA  is  publicly  available  [67]. 

3.4.  Canonical  Dependency  Analysis 

Next,  we  show  how  the  SMI  estimator  can  be  used  for  feature  extraction  from  two  sets  of  data. 

3.4.1.  Introduction 

Canonical  correlation  analysis  (CCA)  [68]  is  a  classical  dimensionality  reduction  technique  for  two 
data  sources,  and  it  iteratively  finds  projection  directions  with  maximum  correlation.  However,  because 
CCA  only  captures  correlations  under  linear  projections,  it  is  often  insufficient  to  analyze  complex 
real-world  data  that  contain  higher-order  correlations.  To  be  more  flexible,  non-linear  CCA  methods  have 
been  explored.  A  simple  approach  uses  neural  networks  to  handle  non-linear  projections  [69,70],  but 
neural  networks  are  prone  to  local  optima.  Another  approach  first  non-linearly  transforms  data  samples 
into  feature  spaces  and  then  apply  linear  CCA  [71,72].  Given  that  the  non-linear  transformation  is  fixed, 
this  two-step  approach  allows  analytic  computation  of  the  global  optimal  solution  via  a  generalized 
eigenvalue  problem  in  the  same  way  as  linear  CCA.  This  non-linear  approach  is  called  kernel  CCA 
(KCCA)  because  reproducing  kernels  [38]  are  used  as  non-linear  transforms.  Alternating  regression  such 
as  the  alternating  conditional  expectation  [73]  is  another  possible  way  to  find  dependency  in  a  flexible 
manner,  which  estimates  transformations  for  two  variables  alternately  by  minimizing  the  squared  error 
between  transformed  variables.  These  non-linear  variants  of  CCA  are  highly  flexible,  although  obtained 
results  are  often  difficult  to  interpret  due  to  the  non-linearity. 

The  above  non-linear  CCA  approaches  can  be  regarded  as  capturing  correlations  along  non-linear 
projection  directions.  Another  extension  of  CCA  called  canonical  dependency  analysis  (CDA)  [30] 
captures  higher-order  correlations  under  linear  projections.  It  was  shown  that  KCCA  with  a  universal 
kernel  [45]  such  as  the  Gaussian  kernel  allows  efficient  detection  of  higher-order  correlations  [74]. 
However,  the  choice  of  universal  kernels  affects  the  practical  performance,  and  there  is  no  systematic 
method  to  choose  a  suitable  kernel  function.  Another  approach  to  higher-order  CCA  called  informational 
CCA  (ICCA)  [75]  uses  mutual  information  (MI)  as  a  dependency  measure,  where  MI  is  estimated 
via  kernel  density  estimation  (KDE).  Because  systematic  model  selection  strategies  are  available  for 
KDE  [76],  ICCA  could  be  more  practical  than  the  KCCA-based  CDA  method.  In  the  ICCA  method, 
one-dimensional  projection  directions  are  found  in  an  iterative  manner.  Thus,  it  would  be  more  powerful 
if  multi-dimensional  projection  directions  (i.e.,  a  subspace)  could  be  directly  found  in  CDA  [30]. 
However,  ICCA  may  not  be  reliable  in  such  a  subspace  search  scenario  because  it  involves  the  ratio 
of  estimated  densities,  which  tends  to  produce  large  estimation  error  if  the  dimensionality  is  not  small. 

To  overcome  the  above  limitation,  an  SMI-based  CDA  method  called  least-squares  CDA  (LSCDA) 
was  proposed  [30].  Below,  we  review  LSCDA. 
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3.4.2.  Canonical  Dependency  Analysis  with  SMI 

Suppose  that  we  are  given  n  i.i.d.  paired  samples  {{xh  yt )  |  cc*  e  Rdx,  yt  e  Rdy}™=1  drawn  from  a 
joint  distribution  with  density  p(x,  y ).  CDA  is  aimed  at  finding  the  low-dimensional  expressions  of  x 
and  y  that  are  maximally  dependent  on  each  other.  Here,  we  focus  on  linear  dimension  reduction,  i.e., 
x  and  y  are  transformed  as  Ux  and  Vy,  where  U  e  RduXdx  and  V  6  R® x  dy  are  orthogonal  matrices 
with  known  dimensionalities  clu  and  dv.  The  objective  of  CDA  is  to  find  the  transformation  matrices 
U  and  V  such  that  the  statistical  dependency  between  Ux  and  Vy  is  maximized.  Let  us  use  the  SMI 
estimator,  LSMI({(£/a?j,  Vr/;)}™=1),  as  the  dependency  measure,  i.e.,  we  solve, 

argmax  LSMI({(LLcj,  VG/i)}"=1) 
c/e  Rdu  x  dx )  v  eKdv  x  dy 

subject  to  UUT  =  Idu  and  VV'  =  Idv 

This  formulation  is  called  least-squares  CDA  (LSCDA)  [30]. 

The  above  optimization  problem  can  be  solved  in  the  same  way  as  LSDR  presented  in  Section  3.3.3. 

A  MATLAB®  implementation  of  LSCDA  is  publicly  available  [77]. 

3.5.  Independent  Component  Analysis 

Here,  we  show  how  the  SMI  estimator  can  be  used  for  independent  component  analysis. 

3.5.1.  Introduction 

Suppose  that  there  exist  statistically  independent  sources  of  signals,  and  we  observe  their  mixtures. 
The  purpose  of  independent  component  analysis  (ICA)  [78]  is  to  separate  the  mixed  signals  into  the 
original  source  signals.  An  approach  to  ICA  is  to  separate  the  mixed  signals  such  that  statistical 
independence  among  separated  signals  is  maximized  under  some  independence  measure. 

Various  methods  for  evaluating  the  statistical  independence  among  random  variables  from  samples 
have  been  explored  so  far.  A  naive  approach  is  to  estimate  probability  densities  based  on  parametric  or 
non-parametric  density  estimation  methods.  However,  finding  an  appropriate  parametric  model  is  not 
straightforward  without  strong  prior  knowledge  and  non-parametric  estimation  is  not  generally  accurate 
in  high-dimensional  problems.  Thus,  this  naive  approach  is  not  reliable  in  practice.  Another  approach  is 
to  approximate  the  entropy  based  on  the  Gram-Charlier  expansion  [79]  or  the  Edgeworth  expansion  [80]. 
An  advantage  of  this  entropy-based  approach  is  that  a  hard  task  of  density  estimation  is  not  directly 
involved.  However,  these  expansion  techniques  are  based  on  the  assumption  that  the  target  density  is 
close  to  Gaussian,  and  violation  of  this  assumption  can  cause  large  approximation  error. 

The  above  approaches  are  based  on  the  probability  densities  of  signals.  Another  line  of  research  that 
does  not  explicitly  involve  probability  densities  employs  non-linear  correlation — signals  are  statistically 
independent  if  and  only  if  all  non-linear  correlations  among  signals  vanish.  Following  this  line, 
computationally  efficient  algorithms  have  been  developed  based  on  a  contrast  function  [81,82],  which  is 
an  approximation  of  the  entropy  or  mutual  information.  However,  non-linearities  in  the  contrast  function 
need  to  be  pre-specified  in  these  methods,  and  thus  they  could  be  inaccurate  if  the  predetermined 
non-linearities  do  not  match  the  target  distribution.  To  cope  with  this  problem,  the  kernel  trick  has 


Entropy  2013, 15 


95 


been  applied  in  ICA,  which  allows  computationally  efficient  evaluation  of  all  non-linear  correlations 
citeJMLR:Bach+Jordan:2002.  However,  its  practical  performance  depends  on  the  choice  of  kernels 
(more  specifically,  the  Gaussian  kernel  width)  and  there  seems  no  theoretically  justified  method  to 
determine  the  kernel  width.  This  is  a  critical  problem  in  unsupervised  learning  tasks  such  as  ICA. 

To  cope  with  this  problem,  an  SMI-based  ICA  algorithm  called  least-squares  independent  component 
analysis  (LICA)  has  been  developed  [31].  Below,  we  review  LICA. 

3.5.2.  Independent  Component  Analysis  with  SMI 

Suppose  there  are  d  signal  sources  and  let:  {xi  |  =  (x^\ . . . ,  xf'y)T  £  Md}™=1  be  i.i.d.  samples 

drawn  from  a  distribution  with  density  p{x).  We  assume  that  elements  x(l) , . . . ,  x(d}  are  statistically 
independent  of  each  other,  i.e.,p(x)  is  factorized  as: 

p(x)  =  p(x^)  ■  ■  ■  p(x 

We  cannot  directly  observe  {.x\,}''=| ,  but  only  their  linearly  mixed  samples  { y, }  ”= , : 


y,  ■=  Uxi 

where  U  is  ad  x  d  invertible  matrix  called  the  mixing  matrix. 

The  goal  of  ICA  is,  from  the  mixed  samples  {yi}\ l=1,  to  obtain  a  demixing  matrix  V  that  recovers  the 
original  source  samples  {x,}^ , .  We  denote  the  demixed  samples  by  {zt}”=1: 

Zi  =  V  yi 

The  ideal  solution  is  V  =  L/-1,  but  we  can  only  recover  the  source  signals  up  to  permutation  and 
scaling  of  components  of  x  due  to  non-identifiability  of  the  ICA  setup  [78].  Let  us  denote  the  demixed 
samples  by: 

z*  =  (zf\...,z\d))T  ■=  Vy% 


for  i  —  1, . . .  ,n. 

A  direct  approach  to  ICA  is  to  determine  V  so  that  elements  of  z  are  as  statistically  independent  as 
possible.  Here,  we  adopt  SMI  as  the  independence  measure: 

SMI(Z«, ....  Z«)  :  If  -J  P(*m)  ■  ■  (ff/.'.'.J)  -  0  2  di<1' ' ' '  ^ 

We  try  to  find  the  demixing  matrix  V  that  minimizes  SMI.  In  practice,  the  following  optimization 
problem  is  solved: 


min  LSMI({Vyi}?=1) 

where  LSMI({Vr/j}”=1)  is  given  by  the  same  form  as  Equation  (12)  (or  Equation  (13)),  but  the  matrix 
H  and  the  vector  h  are  defined  in  a  slightly  different  way.  For  the  Gaussian  kernel, 

/  WVy-Vy’f^ 


K(Vy,Vy')  =  exp 
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H  and  h  are  given  by: 


He  i'  —  — 7 
nd 

1 


n 

m= 1 


£>P 


(zjr>  -  zpy  +  -  zp>f 


2= 1 


2a2 


he  =  -  V'  exp 
n 

i=  1 


pi  - 
2a2 


This  formulation  is  called  least-squares  independent  component  analysis  (LICA)  [31]. 

3.5.3.  Gradient-Based  Demixing  Matrix  Search 

Based  on  the  plain  gradient  technique,  an  update  rule  of  V  is  given  by: 

<9LSMI 


Vi —  V-t- 


dV 


where  t  (>  0)  is  the  step  size.  The  gradient  is  given  by: 


3LSMI 

=  2^° 


dV 


i= i 


dV  2 


q$i<- 


dll 


£,£'=! 


dV 


where 


dh£ 

dVk,k, 

dHe,e 


ncr‘ 


n 


exp 


2=1 

i  r  n 


\Zi  -  zk | 
2a2 


dV 


k,k'  rv 


x 


d- 1 


n 

m^k 


^2  exp 
.  i=l 

n 


(4m)  -  4m))2  +  (4m)  -  4m))2' 

2<r2 


i  n 

1  A  (fe)  (k)  w  (fc')  (fc')\  .  ,  (k)  (k)  w  (fc')  (fe')- 

—2 1 ^  (fc*  _  ^  )(2/i  -  %  )  +  (A  -  4  )U4  -  v\'  . 

2=1 


x  exp 


(4k)  -  y{£k)r  +  (4k)  -  4Y 


2o2 


(17) 


In  ICA,  scaling  of  components  of  z  can  be  arbitrary.  This  implies  that  the  above  gradient  updating 
rule  can  lead  to  a  solution  with  poor  scaling,  which  is  not  preferable  from  a  numerical  viewpoint.  To 
avoid  possible  numerical  instability,  V  is  normalized  at  each  gradient  iteration  as: 


V, 


V, 


k,k' 


k,k' 


yd  ,a2 

zl_>m=l  k,r 


3.5.4.  Natural  Gradient  Demixing  Matrix  Search 

Suppose  that  data  samples  are  whitened,  i.e.,  samples  {'//,;} ”=l  are  pre-transformed  as: 

Vi  < — 

where  £  is  the  sample  covariance  matrix: 


£  :=  - 
n 


1  £(»-;£«)  (" 


T 
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Then  it  can  be  shown  that  any  demixing  matrix  that  eliminates  the  second  order  correlation  is  an 
orthogonal  matrix  [78].  Thus,  for  whitened  data,  the  search  space  of  V  can  be  restricted  to  the  orthogonal 
group  without  loss  of  generality.  The  natural  gradient  [62]  update  rule  on  the  orthogonal  group  is 
given  by: 


V 


Vexp  -t  V 


rT 


<9LSMI  <9LSMI 


T 


dV 


dV 


V 


where  “exp”  for  a  matrix  denotes  the  matrix  exponential  and  t  (>  0)  is  the  step  size. 
A  MATLAB®  implementation  of  LICA  is  publicly  available  [83]. 


3.6.  Cross-Domain  Object  Matching 

Next,  we  show  how  the  SMI  estimator  can  be  used  for  cross-domain  object  matching. 

3.6.1.  Introduction 

The  objective  of  cross-domain  object  matching  is  to  match  two  sets  of  unpaired  objects  in  different 
domains.  For  example,  in  photo  album  summarization,  we  are  given  a  set  of  photos  and  a  designed  photo 
frame  expressed  as  a  set  of  photo  slots  in  the  Cartesian  coordinate  system,  and  we  want  to  automatically 
assign  the  photos  into  the  designed  photo  frame.  A  typical  approach  of  cross-domain  object  matching 
is  to  find  a  mapping  from  objects  in  one  domain  (photos)  to  objects  in  the  other  domain  (frame)  so  that 
the  pairwise  dependency  is  maximized.  In  this  scenario,  accurately  evaluating  the  dependence  between 
objects  is  a  key  issue. 

Kernelized  sorting  [84]  tries  to  find  the  mapping  between  two  domains  that  maximizes  mutual 
information  under  the  Gaussian  assumption.  However,  because  the  Gaussian  assumption  may  not 
be  fulfilled  in  practice,  this  method  tends  to  perform  poorly.  To  overcome  the  above  limitation,  the 
kernel-based  dependence  measure  called  the  Hilbert-Schmidt  independence  criterion  (HSIC)  [85]  was 
proposed  to  use  in  kernelized  sorting  [86].  Because  HSIC  is  distribution-free,  HSIC-based  kernelized 
sorting  is  more  flexible  than  the  original  method  based  on  the  Gaussian  assumption.  However,  HSIC 
includes  a  tuning  parameter  (more  specifically,  the  Gaussian  kernel  width),  and  its  choice  is  crucial  to 
obtain  better  performance  [87]. 

To  cope  with  this  problem,  an  SMI-based  cross-domain  object  matching  method  called  least-squares 
object  matching  (LSOM)  was  developed  [32].  Below,  we  review  LSOM. 

3.6.2.  Cross-Domain  Object  Matching  with  SMI 

The  goal  of  cross-domain  object  matching  is,  given  two  sets  of  unpaired  samples  of  the  same  size, 
{xi  |  Xi  6  A}"=1  and  {yt  \  yL  G  y}\ l=1,  to  find  a  mapping  that  well  “matches”  them.  Let  7r  be 
a  permutation  function  over  {1, . . .  ,n}.  The  optimal  permutation,  denoted  by  n*,  can  be  obtained  as 
the  maximizer  of  the  dependency  between  the  two  sets  {a:,}''=l  and  {yn(i)}i Li-  Here,  we  use  the  SMI 
approximator,  LSMI({(xj,  2/^-(i))}”=i),  as  the  dependency  measure,  i.e.,  we  solve, 


maxLSMI({(a;i)yir(i))}”=1) 

7 T 
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Let  K  and  L  be  the  n  x  n  kernel  matrices  defined  by  Ktj  =  K(x,.  Xj)  and  L,:j  =  IJy,.  yt ) .  Then 
LSMI  for  {{xi ,  2/7r(i))}r=i  can  t>e  expressed  as: 

LSMi({(*i,  y^)}U)  =  ^tr  (nTLnenff)  -  (is) 

where  II  is  the  permutation  matrix  corresponding  to  7 r,  i.e.,  II  is  the  n  x  n  zero-one  matrix  such  that 
IIjj  =  1  if  i  —  7r (j )  for  j  =  1, . . . ,  n  and  n,j  =  0  otherwise,  ©n  is  the  diagonal  matrix  with  diagonal 
elements  given  by  the  LSMI  solution  9W  obtained  by  paired  data  {(xj,  (see  Equation  (11)). 

Because  maximizing  Equation  (18)  with  respect  to  II  is  computationally  infeasible,  greedy  update 
from  previous  solution  II'  is  used  in  practice: 

j-pew  =  ^  _  tjUi  +  t.  argmax  tr  (nTin'0n-lf) 

where  0  <  t  <  1  is  the  step  size.  Maximization  of  the  second  term  is  called  a  linear  assignment  problem, 
which  can  be  solved  efficiently  by  the  Hungarian  method  [88]. 

The  above  method  is  called  least-squares  object  matching  (LSOM)  [32].  A  MATLAB® 
implementation  of  LSOM  is  publicly  available  [89]. 

3.7.  Clustering 

Here,  we  show  how  SMI  can  be  effectively  used  for  clustering. 

3.7.1.  Introduction 

The  objective  of  clustering  is  to  classify  data  samples  into  disjoint  groups  in  an  unsupervised  manner. 
K-means  [90]  is  a  classic  but  still  popular  clustering  algorithm.  However,  k-means  only  produces  linearly 
separated  clusters,  and  thus  its  usefulness  is  rather  limited  in  practice.  To  cope  with  this  problem, 
various  non-linear  clustering  methods  have  been  developed.  Kernel  k-means  [91]  performs  k-means 
in  a  feature  space  induced  by  a  reproducing  kernel  function  [46].  Spectral  clustering  [92,93]  first  unfolds 
non-linear  data  manifolds  by  a  spectral  embedding  method,  and  then  performs  k-means  in  the  embedded 
space.  Blurring  mean-shift  [94,95]  uses  a  non-parametric  kernel  density  estimator  for  modeling  the 
data- generating  probability  density,  and  finds  clusters  based  on  the  modes  of  the  estimated  density. 
Discriminative  clustering  leams  a  discriminative  classifier  for  separating  clusters,  where  class  labels 
are  also  treated  as  parameters  to  be  optimized  [96,97].  Dependence-maximization  clustering  determines 
cluster  assignments  so  that  their  dependence  on  input  data  is  maximized  [34,98,99]. 

Information-maximization  clustering  exhibited  the  state-of-the-art  performance  [100,101],  where 
probabilistic  classifiers  such  as  a  kemelized  Gaussian  classifier  [iOO]  and  a  kernel  logistic  regression 
classifier  [101]  are  learned  so  that  mutual  information  between  feature  vectors  and  cluster  assignments 
is  maximized  in  an  unsupervised  manner.  A  notable  advantage  of  information-maximization  clustering 
is  that  classifier  training  is  formulated  as  continuous  optimization,  which  is  substantially  simpler 
than  discrete  optimization  of  cluster  assignments.  Indeed,  classifier  training  can  be  carried  out  in 
computationally  efficient  manners  by  a  gradient  method  [100]  or  a  quasi-Newton  method  [101]. 
Furthermore,  a  model  selection  strategy  based  on  the  information-maximization  principle  is  also 
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provided  [100].  Thus,  kernel  parameters  can  be  systematically  optimized  in  an  unsupervised  way. 
However,  the  optimization  problems  of  these  clustering  methods  are  non-convex  and  finding  a  good 
local  optimal  solution  is  not  straightforward  in  practice. 

To  overcome  the  above  limitation,  an  SMI-based  clustering  method  called  SMI  clustering  (SMIC) 
was  proposed  [33].  Below,  we  review  SMIC. 

3.7.2.  Clustering  with  SMI 

Suppose  that  we  are  given  ('/-dimensional  i.i.d.  feature  vectors  of  size  n,  {x,  \  x,  e  Md}”=1, 
which  are  drawn  independently  from  a  distribution  with  density  p(x).  The  goal  of  clustering  is 
to  give  cluster  assignments,  {yl  \  yt  <G  {1 , . . . ,  c}}f=1,  to  the  feature  vectors  {.'Z,}['=l ,  where 
c  denotes  the  number  of  clusters,  c  is  assumed  to  be  pre-fixed  below.  To  solve  the  clustering 
problem,  the  information-maximization  approach  is  taken  [100,101].  That  is,  clustering  is  regarded 
as  an  unsupervised  classification  problem,  and  the  class-posterior  probability  p(y\x)  is  learned  so  that 
“information”  between  feature  vector  x  and  cluster  label  y  is  maximized. 

As  an  information  measure,  SMI  Equation  (1)  is  adopted,  which  can  expressed  as: 


(19) 


Suppose  that  the  class-prior  probability  p(y)  is  set  to  a  user-specified  value  7ty  for  y  =  1, . . . ,  c,  where 
Tiy  >  0  and  Y^Cy=\  71  v  =  1-  Without  loss  of  generality,  {71  y}cy=i  are  assumed  to  be  sorted  in  the 
ascending  order: 


7Ti  <  •  •  •  <  7TC 


If  {vry}^=1  is  unknown,  the  uniform  class-prior  distribution  may  be  adopted: 


p(y)  =  ~  for  y  =  1,  •  •  •  ,c 
c 

Substituting  ny  into  p(y),  we  can  express  Equation  (19)  as: 


(20) 


Let  us  approximate  the  class-posterior  probability  p(y\x)  by  the  following  kernel  model: 


n 


(21) 


i= 1 


where  ol  =  (cci,i, . . . ,  ac,n)T  is  the  parameter  vector  and  K(x,  x')  denotes  a  kernel  function.  A  useful 
example  of  kernel  functions  is  the  local-scaling  kernel  [102]  defined  as: 


otherwise 
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where  A 4 (a?)  denotes  the  set  of  k  nearest  neighbors  for  x  (k  is  the  kernel  parameter),  a,  is  a  local  scaling 
factor  defined  as  Oi  =  \\x,  —  x\k 1  |j,  and  x-k>  is  the  /c-th  nearest  neighbor  of  x Note  that  we  did  not 
include  the  normalization  term  in  Equation  (21)  because  model  outputs  will  be  normalized  later  (see 
Equation  (22)). 

Further  approximating  the  expectation  with  respect  to  p(x)  included  in  Equation  (20)  by  the  empirical 
average  of  samples  {xi\™=1,  we  arrive  at  the  following  SMI  approximator: 

SMI  :=  —  V  —  cxlK2cxv  -  - 
2n  ^  nv  y  y  2 
y= i 

where  ay  :=  (ay> i, . . . ,  oty,n)T  and  Kid  :=  K(xu  Xj). 

For  each  cluster  y,  cty  K2cxy  is  maximized  under  ||fvy||  =  1.  Since  this  is  the  Rayleigh  quotient, 
the  maximizer  is  given  by  the  normalized  principal  eigenvector  of  K  [104].  To  avoid  all  the  solutions 
{cty}cy=i  to  be  reduced  to  the  same  principal  eigenvector,  their  mutual  orthogonality  is  imposed: 

ay  =  0  for  y^y' 

Then  the  solutions  are  given  by  the  normalized  eigenvectors  i/q , ,i/jc  associated  with  the  eigenvalues 
Ai  >  •  •  •  >  Xn  >  0  of  K.  Since  the  sign  of  i\)y  is  arbitrary,  the  sign  is  set  as: 

^y  =  ^yx  sign  (ifi  ln) 

where  sign(-)  denotes  the  sign  of  a  scalar  and  ln  denotes  the  n-dimensional  vector  with  all  ones. 

On  the  other  hand,  because 

f  1  n ' 

p(y)  =  /  p(y\x)p(x)dx  »  -  ^2  Q<x(y\xi)  =  atyKln 

J  n  1=1 

and  the  class-prior  probability  p{y)  was  set  to  7ry  for  y  =  1, . . . ,  c,  the  following  normalization  condition 
is  obtained: 


OLyKln  =  7 Ty 


(22) 


Furthermore,  probability  estimates  should  be  non-negative,  which  can  be  achieved  by  rounding  up 
negative  outputs  to  zero. 

By  taking  these  normalization  and  non-negativity  issues  into  account,  cluster  assignment  y?:  for  xt  is 
determined  as  the  maximizer  of  the  approximation  of  p(y\xi): 


[max(0ra,  KiJjy)]^ 

yi  =  argmax  - - =  argmax 

y  tty1  max(0n,  Kxlty)Tln  y 


7ry[max(0n,  ipy)\j 
max(0n,  xjsy)Tln 


where  the  “max”  operation  for  vectors  is  applied  in  the  element-wise  manner  and  [■]*  denotes  the  ?- 1 h 
element  of  a  vector.  Note  that  Kijjy  =  XuY’y  was  used  in  the  above  derivation.  For  out-of-sample 
prediction,  cluster  assignment  y'  for  new  sample  x'  may  be  obtained  as: 


tty  max  f  0,  YTi= 1  K(x'i  *t)[^y]<) 
y  \=  argmax  -  v 

v 


Xy  max(0n,  i/?y)Tln 
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The  above  method  is  called  SMI-based  clustering  (SMIC)  [33].  LSMI  can  be  used  for  model  selection 
of  SMIC,  i.e.,  LSMI  is  computed  as  a  function  of  the  kernel  parameter  included  in  K(x.  x')  and  the 
maximizer  of  LSMI  is  chosen  as  the  most  promising  one.  A  MATLAB®  implementation  of  SMIC  is 
publicly  available  [103]. 

3.8.  Causal  Direction  Estimation 

Finally,  we  show  how  the  SMI  estimator  can  be  used  for  causal  direction  estimation. 

3.8.1.  Introduction 

Learning  causality  from  data  is  one  of  the  important  challenges  in  the  artificial  intelligence,  statistics, 
and  machine  learning  communities  [105].  A  traditional  method  of  learning  causal  relationship  from 
observational  data  is  based  on  the  linear-dependence  Gaussian-noise  model  [106].  However,  the  linear- 
Gaussian  assumption  is  too  restrictive  and  may  not  be  fulfilled  in  practice.  Recently,  non-Gaussianity 
and  non-linearity  have  been  shown  to  be  beneficial  in  causal  inference,  because  it  can  break  symmetry 
between  observed  variables  [107,108].  Since  then,  much  attention  has  been  paid  to  the  discovery  of 
non-linear  causal  relationship  through  non-Gaussian  noise  models  [109]. 

In  the  framework  of  non-linear  non-Gaussian  causal  inference,  the  relation  between  a  cause  X  and 
an  effect  Y  is  assumed  to  be  described  by  Y  =  f(X)  +  E,  where  /  is  a  non-linear  function  and  E  is 
non-Gaussian  additive  noise  that  is  independent  of  the  cause  X.  Under  this  additive  noise  assumption, 
it  was  shown  [108]  that  the  causal  direction  between  X  and  Y  can  be  identified  based  on  a  hypothesis 
test  of  whether  the  causal  model  Y  =  f(X  )  +  E  or  the  alternative  model  X  =  f'(Y)  +  E'  fits  the  data 
well — here,  the  goodness  of  fit  is  measured  by  independence  between  inputs  and  residuals  {i.e.,  estimated 
noise).  In  [108],  the  functions  /  and  f  were  learned  by  the  Gaussian  process  (GP)  regression  [110],  and 
the  independence  between  inputs  and  residuals  was  evaluated  by  the  Hilbert-Schmidt  independence 
criterion  (HSIC)  [85]. 

However,  standard  regression  methods  such  as  GP  are  designed  to  handle  Gaussian  noise,  and  thus 
they  may  not  be  suited  for  discovering  causality  in  the  non-Gaussian  additive  noise  formulation.  To  cope 
with  this  problem,  an  alternative  regression  method  called  HSIC  regression  was  proposed  [109],  which 
learns  a  function  so  that  the  dependence  between  inputs  and  residuals  is  directly  minimized  based  on 
HSIC.  Through  experiments,  HSIC  regression  was  shown  to  outperform  the  GP-based  method  [109]. 
However,  the  choice  of  the  kernel  width  in  HSIC  regression  heavily  affects  the  sensitivity  of  the 
independence  measure,  and  systematic  model  selection  strategies  are  not  available.  Another  weakness 
of  HSIC  regression  is  that  the  kernel  width  of  the  regression  model  is  fixed  to  the  same  value  as  HSIC. 
This  crucially  limits  the  flexibility  of  function  approximation  in  HSIC  regression. 

To  overcome  the  above  weaknesses,  an  SMI-based  regression  method  for  causal  inference  called 
least-squares  independence  regression  (LSIR)  was  developed  [35].  Below,  we  review  LSIR. 
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3.8.2.  Dependence  Minimizing  Regression  with  SMI 

Suppose  random  variables  X  e  M  and  Y  e  M  are  connected  by  the  following  additive 
noise  model  [108]: 


Y  =  f(X )  +  E 


where  /  :  M  — »  R.  is  some  non-linear  function  and  E  6  M  is  a  zero-mean  random  variable  that 
is  independent  of  X.  The  goal  of  dependence  minimizing  regression  is,  from  i.i.d.  paired  samples 
{(xj,  t/j)}"=1,  to  obtain  a  function  /  such  that  input  X  and  estimated  additive  noise  E  =  Y  —  /(X) 
are  independent. 

Let  us  employ  a  linear  model  for  dependence  minimizing  regression: 


fp(x)  =  =  /3TV>0) 

/=1 


where  m  is  the  number  of  basis  functions,  (3  =  (/3i, . . . , /3m)T  are  regression  parameters,  and 
i/>(x)  =  (V,i(a;))  ■  ■  • ,  fpm(x))T  are  basis  functions.  In  LSMI-based  dependence  minimization  regression, 
the  regression  parameters  f3  are  learned  as: 


min 

fi 


'LSMI({(*..«.)}W  +  \f>TP 


where  ez  —  yi  —  fp(xi)  is  the  residual  and  7  >  0  is  the  regularization  parameter  to  avoid  overfitting. 
For  regression  parameter  learning,  a  gradient  descent  method  may  be  used: 


0 


(3  —  t 


f  9LSMI  \ 
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where  t  is  the  step  size.  The  gradient  'n;^n  can  be  approximately  expressed  as: 
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i,j= 1 


(Xj  -  Xj)2  +  (ej  -  ee)2  +  (Xj  -  xy)2  +  (ej  - 

2a2 


x  (ej  -  ei)^(xi)  +  (e*  -  ee)^(xj) 


Note  that,  in  the  above  derivation,  the  dependence  of  (3  on  et  is  ignored  for  simplicity.  Although  it  is 
possible  to  exactly  compute  the  derivative  in  principle,  this  approximated  expression  is  computationally 
more  efficient  with  good  performance  in  practice. 
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By  taking  into  account  the  assumption  that  the  mean  of  noise  E  is  zero,  the  final  regressor  is 
obtained  as: 


i=  1 


This  method  is  called  least-squares  independence  regression  (LSIR)  [35].  A  MATLAB® 
implementation  of  LSIR  is  publicly  available  [111]. 

3.8.3.  Causal  Direction  Inference  by  LSIR 

Our  final  goal  is,  given  i.i.d.  paired  samples  {(ay,  ,  to  determine  whether  X  causes  Y  or  vice 

versa  under  the  additive  noise  assumption.  To  this  end,  we  test  whether  the  causal  model  Y  =  fy(X)  + 
EY  or  the  alternative  model  X  =  fx(Y)+Ex  fits  the  data  well,  where  the  goodness  of  fit  is  measured  by 
independence  between  inputs  and  residuals  (i.e.,  estimated  noise).  Independence  of  inputs  and  residuals 
may  be  decided  in  practice  based  on  the  permutation  test  procedure  [47]. 

More  specifically,  LSIR  is  first  run  for  {(ay,  ?y)}"=1  as  usual,  and  obtain  a  regression  function  /.  This 
procedure  also  provides  an  SMI  estimate,  LSMI  ({(ay,  fy)}f=1),  where  e)  =  ty  —  /(ay).  Next,  pairs 
of  input  and  residual  {(ay,  ey)}”=1  are  randomly  permuted  as  {(ay,  where  7r(-)  is  a  randomly 

generated  permutation  function.  Note  that  the  permuted  pairs  of  samples  are  independent  of  each  other 
because  the  random  permutation  breaks  the  dependency  between  X  and  E  (if  it  exists).  Then,  an  SMI 
estimate  for  the  permuted  data,  LSMI({(ay,  e)r(i) )}”=].), 's  computed.  This  random  permutation  process 
is  repeated  many  times,  and  the  distribution  of  LSMI  values  under  the  null-hypothesis  that  X  and  E 
are  independent  is  constructed.  Finally,  the  p- value  is  approximated  by  evaluating  the  relative  ranking 
of  LSMI  computed  from  the  original  input-residual  data,  LSMI({(ay,  e))}]l=1),  over  the  distribution  of 
LSMI  values  for  randomly  permuted  data. 

In  order  to  decide  the  causal  direction,  the  p- values  px-,r  and  pxy-Y  for  both  directions  X  — >  Y  (i.e., 
X  causes  Y)  and  X  -p-  Y  (i.e.,  Y  causes  X)  are  computed.  Then,  for  a  given  significance  level  5,  the 
causal  direction  is  determined  as  follows: 

•  If  px^v  >  <5  and  px^-Y  <  <5,  the  causal  model  X  — >  Y  is  chosen. 

•  If  Pxx-y  >  <5  and  px—>Y  <  5,  the  causal  model  X  <—  Y  is  selected. 

•  If  px^YiPxy-Y  <  <5,  perhaps  there  is  no  causal  relation  between  X  and  Y  or  our  modeling 
assumption  is  not  correct  (e.g.,  an  unobserved  confounding  variable  exists). 

•  If  px^-YiPxy-Y  >  <5,  perhaps  our  modeling  assumption  is  not  correct  or  it  is  not  possible  to  identify 
a  causal  direction  (i.e.,  X,  Y,  and  E  are  Gaussian  random  variables). 

When  we  have  prior  knowledge  that  there  exists  a  causal  relation  between  X  and  Y  but  the  causal 
direction  is  unknown,  the  values  of  px-^Y  and  Pxy-y  may  be  simply  compared  for  determining  the  causal 
direction  as  follows: 

•  If  px^Y  >  Px-c-y,  we  conclude  that  X  causes  Y. 

•  Otherwise,  we  conclude  that  Y  causes  X. 

This  simplified  procedure  does  not  include  the  computational  expensive  permutation  process  and  thus  it 
is  computationally  very  efficient. 
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4.  Conclusions 

In  this  article,  we  reviewed  recent  development  in  the  estimation  of  squared-loss  mutual  information 
(SMI)  and  its  application  to  machine  learning.  The  key  idea  for  accurately  estimating  SMI  is  to 
directly  estimate  the  ratio  of  probability  densities  without  separately  estimating  each  density.  A  notable 
advantage  of  the  SMI  estimator  called  least-squares  mutual  information  (LSMI)  [19]  is  that  it  can 
be  computed  analytically  in  a  computationally  more  efficient  and  numerically  more  stable  way  than 
ordinary  MI. 

We  have  introduced  SMI  as  a  measure  of  statistical  independence  between  random  variables.  On  the 
other  hand,  ordinary  MI  has  a  rich  information-theoretic  interpretation  via  entropies.  Thus,  it  is  important 
to  investigate  an  information-theoretic  meaning  of  SMI,  which  remains  to  be  an  open  question  currently. 

Various  methods  of  direct  density-ratio  estimation  have  been  explored  so  far  [16,18],  and  such 
density  ratio  estimators  were  shown  to  be  applicable  to  an  even  wider  class  of  machine  learning 
tasks  beyond  SMI  estimation,  such  as  non-stationarity  adaptation  [112],  outlier  detection  [113], 
change  detection  [114,115],  class-balance  estimation  [116],  two-sample  homogeneity  testing  [117,118], 
probabilistic  classification  [119,120],  and  conditional  density  estimation  [121]. 

Improving  the  accuracy  of  density  ratio  estimation  contributes  to  enhancing  the  performance  of  the 
above  machine  learning  solutions.  Recent  advances  in  this  line  of  research  include  dimensionality 
reduction  for  density  ratio  estimation  [122-124],  a  unified  statistical  framework  of  density  ratio 
estimation  [18],  and  extensions  to  relative  density  ratios  [125]  and  density  differences  [126].  Further 
improving  the  accuracy  and  computational  efficiency  and  exploring  new  application  areas  are  important 
future  directions  to  pursue. 

More  program  codes  are  publicly  available  [127]. 
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Abstract 

Divergence  estimators  based  on  direct  approximation  of  density-ratios  without  go¬ 
ing  through  separate  approximation  of  numerator  and  denominator  densities  have 
been  successfully  applied  to  machine  learning  tasks  that  involve  distribution  com¬ 
parison  such  as  outlier  detection,  transfer  learning,  and  two-sample  homogeneity 
test.  However,  since  density-ratio  functions  often  possess  high  fluctuation,  diver¬ 
gence  estimation  is  still  a  challenging  task  in  practice.  In  this  paper,  we  propose  to 
use  relative  divergences  for  distribution  comparison,  which  involves  approximation 
of  relative  density-ratios.  Since  relative  density-ratios  are  always  smoother  than 
corresponding  ordinary  density-ratios,  our  proposed  method  is  favorable  in  terms 
of  the  non-parametric  convergence  speed.  Furthermore,  we  show  that  the  proposed 
divergence  estimator  has  asymptotic  variance  independent  of  the  model  complexity 
under  a  parametric  setup,  implying  that  the  proposed  estimator  hardly  overfits  even 
with  complex  models.  Through  experiments,  we  demonstrate  the  usefulness  of  the 
proposed  approach. 
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1  Introduction 

Comparing  probability  distributions  is  a  fundamental  task  in  statistical  data  processing.  It 
can  be  used  for,  e.g.,  outlier  detection  (Smola  et  al.,  2009;  Hido  et  ab,  2011),  two-sample 
homogeneity  test  (Gretton  et  al.,  2007;  Sugiyama  et  al.,  2011),  and  transfer  learning 
(Shimodaira,  2000;  Sugiyama  et  al.,  2007). 

A  standard  approach  to  comparing  probability  densities  p(x)  and  p'(x)  would  be  to 
estimate  a  divergence  from  p(x)  to  p'(x),  such  as  the  Kullback-Leibler  (KL)  divergence 
(Kullback  and  Leibler,  1951): 


(  P(x ) 

\pfx) 


A  naive  way  to  estimate  the  KL  divergence  is  to  separately  approximate  the  densities  p(x) 
and  p'( x)  from  data  and  plug  the  estimated  densities  in  the  above  definition.  However, 
since  density  estimation  is  known  to  be  a  hard  task  (Vapnik,  1998),  this  approach  does  not 
work  well  unless  a  good  parametric  model  is  available.  Recently,  a  divergence  estimation 
approach  which  directly  approximates  the  density  ratio , 


without  going  through  separate  approximation  of  densities  p(x)  and  p'(x)  has  been  pro¬ 
posed  (Sugiyama  et  al.,  2008;  Nguyen  et  al.,  2010).  Such  density-ratio  approximation 
methods  were  proved  to  achieve  the  optimal  non-parametric  convergence  rate  in  the  mini¬ 
max  sense. 

However,  the  KL  divergence  estimation  via  density-ratio  approximation  is  computa¬ 
tionally  rather  expensive  due  to  the  non-linearity  introduced  by  the  ‘log’  term.  To  cope 
with  this  problem,  another  divergence  called  the  Pearson  (PE)  divergence  (Pearson,  1900) 
is  useful.  The  PE  divergence  from  p(x)  to  p\x)  is  defined  as 


The  PE  divergence  is  a  squared-loss  variant  of  the  KL  divergence,  and  they  both  belong  to 
the  class  of  the  Ali-Silvey-Csiszar  divergences  (which  is  also  known  as  the  f -divergences, 
see  Ali  and  Silvey,  1966;  Csiszar,  1967).  Thus,  the  PE  and  KL  divergences  share  similar 
properties,  e.g.,  they  are  non-negative  and  vanish  if  and  only  if  p(x)  =  p\x). 

Similarly  to  the  KL  divergence  estimation,  the  PE  divergence  can  also  be  accurately 
estimated  based  on  density-ratio  approximation  (Kanamori  et  al.,  2009):  the  density-ratio 
approximator  called  unconstrained  least-squares  importance  fitting  (uLSIF)  gives  the  PE 
divergence  estimator  analytically,  which  can  be  computed  just  by  solving  a  system  of 
linear  equations.  The  practical  usefulness  of  the  uLSIF-based  PE  divergence  estimator 
was  demonstrated  in  various  applications  such  as  outlier  detection  (Hido  et  ab,  2011),  two- 
sample  homogeneity  test  (Sugiyama  et  ab,  2011),  and  dimensionality  reduction  (Suzuki 
and  Sugiyama,  2010). 
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In  this  paper,  we  first  establish  the  non-parametric  convergence  rate  of  the  uLSIF- 
based  PE  divergence  estimator,  which  elucidates  its  superior  theoretical  properties.  How¬ 
ever,  it  also  reveals  that  its  convergence  rate  is  actually  governed  by  the  ‘sup 5 -norm  of 
the  true  density-ratio  function:  ma xxr(x).  This  implies  that,  in  the  region  where  the 
denominator  density  p'(x)  takes  small  values,  the  density  ratio  r(x )  =  p(x) / p' (x)  tends 
to  take  large  values  and  therefore  the  overall  convergence  speed  becomes  slow.  More  crit¬ 
ically,  density  ratios  can  even  diverge  to  infinity  under  a  rather  simple  setting,  e.g.,  when 
the  ratio  of  two  Gaussian  functions  is  considered  (Cortes  et  ah,  2010).  This  makes  the 
paradigm  of  divergence  estimation  based  on  density-ratio  approximation  unreliable. 

In  order  to  overcome  this  fundamental  problem,  we  propose  an  alternative  approach  to 
distribution  comparison  called  a -relative  divergence  estimation.  In  the  proposed  approach, 
we  estimate  the  quantity  called  the  a-relative  divergence,  which  is  the  divergence  from 
p(x)  to  the  a-mixture  density  ap(x)  +  (1  —  a)p'(x)  for  0  <  a  <  1.  For  example,  the 
a-relative  PE  divergence  is  given  by 

PE a[p(x),p'(x)\  :=  PE [p(x),ap(x)  +  (1  -  a)p  (x)\ 

—  -  [  ( — — — — —  —  1^  iapix)  +  (1  —  a)p' ix))  dan 

2./  Vap(*)  +  (l-aMa;)  /  V  J  1  K  ” 

We  estimate  the  a-relative  divergence  by  direct  approximation  of  the  a-relative  density- 
ratio: 


p(x 

ap(x)  +  (1  —  a)p'(x)  ’ 

A  notable  advantage  of  this  approach  is  that  the  a-relative  density-ratio  is  always 
bounded  above  by  1/a  when  a  >  0,  even  when  the  ordinary  density-ratio  is  unbounded. 
Based  on  this  feature,  we  theoretically  show  that  the  a-relative  PE  divergence  estima¬ 
tor  based  on  a-relative  density-ratio  approximation  is  more  favorable  than  the  ordinary 
density-ratio  approach  in  terms  of  the  non-parametric  convergence  speed. 

We  further  prove  that,  under  a  correctly-specified  parametric  setup,  the  asymptotic 
variance  of  our  a-relative  PE  divergence  estimator  does  not  depend  on  the  model  com¬ 
plexity.  This  means  that  the  proposed  a-relative  PE  divergence  estimator  hardly  overfits 
even  with  complex  models. 

Through  extensive  experiments  on  outlier  detection,  two-sample  homogeneity  test,  and 
transfer  learning,  we  demonstrate  that  our  proposed  a-relative  PE  divergence  estimator 
compares  favorably  with  alternative  approaches. 

The  rest  of  this  paper  is  structured  as  follows.  In  Section  2,  our  proposed  relative  PE 
divergence  estimator  is  described.  In  Section  3,  we  provide  non-parametric  analysis  of  the 
convergence  rate  and  parametric  analysis  of  the  variance  of  the  proposed  PE  divergence 
estimator.  In  Section  4,  we  experimentally  evaluate  the  performance  of  the  proposed 
method  on  various  tasks.  Finally,  in  Section  5,  we  conclude  the  paper  by  summarizing 
our  contributions  and  describing  future  prospects. 


ra(x )  : 
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2  Estimation  of  Relative  Pearson  Divergence  via 
Least-Squares  Relative  Density-Ratio  Approxima¬ 
tion 

In  this  section,  we  propose  an  estimator  of  the  relative  Pearson  (PE)  divergence  based  on 
least-squares  relative  density-ratio  approximation. 


2.1  Problem  Formulation 

Suppose  we  are  given  independent  and  identically  distributed  (i.i.d.)  samples  {xi\™=1 
from  a  d- dimensional  distribution  P  with  density  p(x)  and  i.i.d.  samples  {cc)}”=1  from 
another  d-dimensional  distribution  P'  with  density  p'{x)\ 

M?=i  ^  P, 

{x’jtU  ~d-  P’- 

The  goal  of  this  paper  is  to  compare  the  two  underlying  distributions  P  and  P'  only  using 
the  two  sets  of  samples  {xi}™=l  and  {xj}j=1. 

For  0  <  a  <  1,  let  qa(x )  be  the  a-mixture  density  of  p(x)  and  p’[x ): 


qa(x )  :=  ap(x)  +  (1  -  a)p'(x). 
Let  ra(x)  be  the  a-relative  density-ratio  of  p(x)  and  p'(x)\ 


.  I  ■  P(x ) _  P(x) 

ap(x)  +  (1  -  a)p'(x)  qa(x)' 

We  define  the  a-relative  PE  divergence  from  p(x)  to  p'{x)  as 

PEq  :=  ^Eqa{x)  [{ra{x)  -  l)2]  , 

where  Ep(£C)[/(a;)]  denotes  the  expectation  of  f(x)  under  p(x): 


f(x)p(x)dx. 


(1) 

(2) 


When  a  =  0,  PEa  is  reduced  to  the  ordinary  PE  divergence.  Thus,  the  a-relative  PE 
divergence  can  be  regarded  as  a  ‘smoothed’  extension  of  the  ordinary  PE  divergence. 

Below,  we  give  a  method  for  estimating  the  a-relative  PE  divergence  based  on  the 
approximation  of  the  a-relative  density-ratio. 
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2.2  Direct  Approximation  of  o-Relative  Density-Ratios 

Here,  we  describe  a  method  for  approximating  the  ct-relative  density-ratio  (1). 

Let  us  model  the  ct-relative  density-ratio  rQ(x)  by  the  following  kernel  model: 

n 

g(x-,  6)  :=  ^9eK(x,Xe), 

l=\ 

where  6  :=  (9i, . . . ,  9n)T  are  parameters  to  be  learned  from  data  samples,  T  denotes 
the  transpose  of  a  matrix  or  a  vector,  and  K(x,x')  is  a  kernel  basis  function.  In  the 
experiments,  we  use  the  Gaussian  kernel: 

.  (  II as  —  as' || 2 \  .  , 

K(x,x)  —  exp  y - — - J,  (3) 


where  a  (>  0)  is  the  kernel  width. 

The  parameters  6  in  the  model  g(xm,  6)  are  determined  so  that  the  following  expected 
squared-error  J  is  minimized: 

J(0)  :=  ]^qa{x)  [(g(x-0)  -  ra(x)f] 

Oi  (1  —  (X  ) 

=  yEpOz)  [g(x;  0)2]  +  2  Ep/(a;)  [g(x;  9j2]  -  Ep(a!)  [g(x-  6 )]  +  Const., 

where  we  used  ra(x)qa(x)  =  p(x)  in  the  third  term.  Approximating  the  expectations  by 
empirical  averages,  we  obtain  the  following  optimization  problem: 

6  :=  argmin 

0eMn 

where  a  penalty  term  \6J 6/2  is  included  for  regularization  purposes,  and  A  (>  0)  denotes 
the  regularization  parameter.  H  is  the  n  x  n  matrix  with  the  (£,  £')-th  element 


-6tH6  -  hT 9  +  -6T0 

2  2 


(4) 


Hey  ■=  K(xi,xe)K(xi,xe> 

n  J 


(1 


a) 


i=  1 


TV 


K (x'j:  xe)K (a;'-,  Xy 

3=1 


(5) 


h  is  the  n- dimensional  vector  with  the  Gth  element 

1  n 

he  :=  -  V'  K(Xi,xt). 
n 

i= 1 

It  is  easy  to  confirm  that  the  solution  of  Eq.(4)  can  be  analytically  obtained  as 
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where  In  denotes  the  n-dimensional  identity  matrix.  Finally,  a  density-ratio  estimator  is 
given  as 

n 

ra(x)  :=  g(x;  0)  =  ^  9eK(x,  xj).  (6) 

l=  l 

When  a  =  0,  the  above  method  is  reduced  to  a  direct  density-ratio  estimator  called 
unconstrained  least-squares  importance  fitting  (uLSIF;  Kanamori  et  ah,  2009).  Thus,  the 
above  method  can  be  regarded  as  an  extension  of  uLSIF  to  the  a-relative  density-ratio. 
For  this  reason,  we  refer  to  our  method  as  relative  uLSIF  (RuLSIF). 

The  performance  of  RuLSIF  depends  on  the  choice  of  the  kernel  function  (the  kernel 
width  cr  in  the  case  of  the  Gaussian  kernel)  and  the  regularization  parameter  A.  Model 
selection  of  RuLSIF  is  possible  based  on  cross-validation  with  respect  to  the  squared-error 
criterion  J,  in  the  same  way  as  the  original  uLSIF  (Kanamori  et  al.,  2009). 


2.3  ^-Relative  PE  Divergence  Estimation  Based  on  RuLSIF 


Using  an  estimator  of  the  a-relative  density-ratio  ra(x),  we  can  construct  estimators  of 
the  a-relative  PE  divergence  (2).  After  a  few  lines  of  calculation,  we  can  show  that  the 
a-relative  PE  divergence  (2)  is  equivalently  expressed  as 

PEq  =  -^Ep(a.)  [■ rQ(x )2]  -  —  9°^Ep>(a.)  [ ra(x )2]  +  Ep(3;)  [ra(x)}  -  ^ 

=  [r«(a3)]  —  2' 

Note  that  the  first  line  can  also  be  obtained  via  Legendre-Fenchel  convex  duality  of  the 
divergence  functional  (Rockafcllar,  1970). 

Based  on  these  expressions,  we  consider  the  following  two  estimators: 


PEa  := 

PEq  := 


n  \  n  i  71 


i= 1 


r[Xi 


3= 1 


i— 1 


1 

2  n 


E 

i=l 


r  X; 


1 

2’ 


(7) 

(8) 


We  note  that  the  a-relative  PE  divergence  (2)  can  have  further  different  expressions  than 
the  above  ones,  and  corresponding  estimators  can  also  be  constructed  similarly.  However, 
the  above  two  expressions  will  be  particularly  useful:  the  first  estimator  PEQ  has  superior 
theoretical  properties  (see  Section  3)  and  the  second  one  PEQ  is  simple  to  compute. 


2.4  Illustrative  Examples 

Here,  we  numerically  illustrate  the  behavior  of  RuLSIF  (6)  using  toy  datasets.  Let  the 
numerator  distribution  be  P  =  N( 0, 1),  where  N(p,,a2)  denotes  the  normal  distribution 
with  mean  /i  and  variance  a2.  The  denominator  distribution  P'  is  set  as  follows: 
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(a)  P'  =  N (0,1):  P  and  P'  are  the  same. 

(b)  P'  =  Ar(0.  0.6):  P'  has  smaller  standard  deviation  than  P. 

(c)  P'  =  N(0,2):  P'  has  larger  standard  deviation  than  P. 

(d)  P'  =  7V(0.5, 1):  P  and  P'  have  different  means. 

(e)  P'  =  0.95iV(0, 1)  +  0.05iV(3, 1):  P'  contains  an  additional  component  to  P. 

We  draw  n  —  n'  —  300  samples  from  the  above  densities,  and  compute  RuLSIF  for  a  =  0, 
0.5,  and  0.95. 

Figure  1  shows  the  true  densities,  true  density-ratios,  and  their  estimates  by  RuLSIF. 
As  can  be  seen  from  the  graphs,  the  profiles  of  the  true  a-relative  density-ratios  get 
smoother  as  a  increases.  In  particular,  in  the  datasets  (b)  and  (d),  the  true  density-ratios 
for  a  =  0  diverge  to  infinity,  while  those  for  a  =  0.5  and  0.95  are  bounded  (by  1/a). 
Overall,  as  a  gets  large,  the  estimation  quality  of  RuLSIF  tends  to  be  improved  since  the 
complexity  of  true  density-ratio  functions  is  reduced. 

Note  that,  in  the  dataset  (a)  where  p(x)  =  p'(x),  the  true  density-ratio  rQ(x )  does 
not  depend  on  a  since  ra(x )  =  1  for  any  a.  However,  the  estimated  density-ratios  still 
depend  on  a  through  the  matrix  H  (see  Eq.(5)). 


3  Theoretical  Analysis 


In  this  section,  we  analyze  theoretical  properties  of  the  proposed  PE  divergence  estima¬ 
tors.  More  specifically,  we  provide  non-parametric  analysis  of  the  convergence  rate  in 
Section  3.1,  and  parametric  analysis  of  the  estimation  variance  in  Section  3.2.  Since  our 
theoretical  analysis  is  highly  technical,  we  focus  on  explaining  practical  insights  we  can 
gain  from  the  theoretical  results  here;  we  describe  all  the  mathematical  details  of  the  non- 
parametric  convergence-rate  analysis  in  Appendix  A  and  the  parametric  variance  analysis 
in  Appendix  B. 

For  theoretical  analysis,  let  us  consider  a  rather  abstract  form  of  our  relative  density- 
ratio  estimator  described  as 


a 

argmm  — 
g£Q 


where  Q  is  some  function  space  (i.e.,  a  statistical  model)  and  R()  is  some  regularization 
functional. 

3.1  Non-Parametric  Convergence  Analysis 

First,  we  elucidate  the  non-parametric  convergence  rate  of  the  proposed  PE  estimators. 
Here,  we  practically  regard  the  function  space  Q  as  an  infinite-dimensional  reproducing 
kernel  Hilbert  space  (RKHS;  Aronszajn,  1950)  such  as  the  Gaussian  kernel  space,  and 
R(-)  as  the  associated  RKHS  norm. 
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(a)  P'  =  N( 0, 1):  P  and  P'  are  the  same. 


(b)  P'  =  N( 0,0.6):  P'  has  smaller  standard  deviation  than  P. 


PE0  95  =  -  0.000112 

-"^0.95(4 

x  r0.95(ah) 

O  ^0.95 (x'j) 

(c)  P'  =  7V(0,2):  P'  has  larger  standard  deviation  than  P. 


0.5 

0.4 

0.3 

0.2 

0.1 


(d)  P'  =  7V( 0.5, 1):  P  and  P'  have  different  means. 


X  X  X  X 

(e)  P'  =  0.951V(0, 1)  +  0.057V(3, 1):  P’  contains  an  additional  component  to  P. 


Figure  1:  Illustrative  examples  of  density-ratio  approximation  by  RuLSIF.  From  left  to 
right:  true  densities  (P  =  iV(0, 1)),  true  density-ratios,  and  their  estimates  for  a  =  0,  0.5, 
and  0.95. 
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3.1.1  Theoretical  Results 

Let  us  represent  the  complexity  of  the  function  space  Q  by  7  (0  <  7  <  2);  the  larger  7  is, 
the  more  complex  the  function  class  Q  is  (see  Appendix  A  for  its  precise  definition).  We 
analyze  the  convergence  rate  of  our  PE  divergence  estimators  as  n  :=  min(n,  n')  tends  to 
infinity  for  X  =  Xn  under 

Xn  — >  o(l)  and  A^1  =  o(h2//^2+7^). 

The  first  condition  means  that  Xfl  tends  to  zero,  but  the  second  condition  means  that  its 
shrinking  speed  should  not  be  too  fast. 

Under  several  technical  assumptions  detailed  in  Appendix  A,  we  have  the  following 
asymptotic  convergence  results  for  the  two  PE  divergence  estimators  PEa  (7)  and  PEa 
(8): 

PEq  -  PEq  =  O^fT^cWraWvo  +  A^  max(l,  R(ra)2)),  (10) 

and 

PEq  -  PE ;a  =  Op  ^Xf/2 1 1  1 1  )^2  max{  1,  R(ra)} 

+  As  max{l,  ||ra||^"7/2)/2,A(rQ)||rQ||^"7/2)/2,  R(rQ)}),  (11) 

where  Op  denotes  the  asymptotic  order  in  probability, 

c  :=  (1  +  Q')^/vp(a.)[ra(a;)]  +  (1  -  a)^Jvp^x)[ra(x)]}  (12) 

and  Vp(cc)[/(x)]  denotes  the  variance  of  f{x)  under  p(x): 

vp(*)  [/(*)]  =  J  (^f(x)~  f  f(x)p(x)dx^j  p(x)dx. 


3.1.2  Interpretation 

In  both  Eq.(10)  and  Eq.(ll),  the  coefficients  of  the  leading  terms  (i.e. ,  the  first  terms)  of 
the  asymptotic  convergence  rates  become  smaller  as  1 1 rQ  1 1  gets  smaller.  Since 


(a  +  (1  -  a)/r{xj) 


-1 


<  -  for  a  >  0, 

rv  ' 


larger  a  would  be  more  preferable  in  terms  of  the  asymptotic  approximation  error.  Note 
that  when  a  =  0,  H^Hoo  can  tend  to  infinity  even  under  a  simple  setting  that  the  ratio  of 
two  Gaussian  functions  is  considered  (Cortes  et  ah,  2010,  see  also  the  numerical  examples 
in  Section  2.4  of  this  paper).  Thus,  our  proposed  approach  of  estimating  the  a-relative 
PE  divergence  (with  a  >  0)  would  be  more  advantageous  than  the  naive  approach  of 
estimating  the  plain  PE  divergence  (which  corresponds  to  a  =  0)  in  terms  of  the  11011- 
parametric  convergence  rate. 
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The  above  results  also  show  that  PEQ  and  PEQ  have  different  asymptotic  convergence 
rates.  The  leading  term  in  Eq.(10)  is  of  order  h-1/2,  while  the  leading  term  in  Eq.(ll) 
is  of  order  A^2,  which  is  slightly  slower  (depending  on  the  complexity  7)  than 
Thus,  PEa  would  be  more  accurate  than  PEQ  in  large  sample  cases.  Furthermore,  when 
p(x)  =  p'(x),  Vp(£C)[rCf(a;)]  =  0  holds  and  thus  c  =  0  holds  (see  Eq.(12)).  Then  the  leading 
term  in  Eq.(10)  vanishes  and  therefore  PEa  has  the  even  faster  convergence  rate  of  order 
An,  which  is  slightly  slower  (depending  on  the  complexity  7)  than  fD1 .  Similarly,  if  a  is 
close  to  1,  ra(x )  m  1  and  thus  c  ~  0  holds. 

When  n  is  not  large  enough  to  be  able  to  neglect  the  terms  of  o(h-1/2),  the  terms  of 
0( An)  matter.  If  llr^Hoo  and  R(ra)  are  large  (this  can  happen,  e.g.,  when  a  is  close  to 
0),  the  coefficient  of  the  0(A^)-term  in  Eq.(10)  can  be  larger  than  that  in  Eq.(ll).  Then 
PEa  would  be  more  favorable  than  PEV>  in  terms  of  the  approximation  accuracy. 

3.1.3  Numerical  Illustration 

Let  us  numerically  investigate  the  above  interpretation  using  the  same  artificial  dataset 
as  Section  2.4. 

Figure  2  shows  the  mean  and  standard  deviation  of  PEQ  and  PEa  over  100  runs  for 
o  =  0,  0.5,  and  0.95,  as  functions  of  n  (=  n '  in  this  experiment).  The  true  PEQ  (which 
was  numerically  computed)  is  also  plotted  in  the  graphs.  The  graphs  show  that  both  the 
estimators  PEV>  and  PE0  approach  the  true  PEVv  as  the  number  of  samples  increases,  and 
the  approximation  error  tends  to  be  smaller  if  a  is  larger. 

When  a  is  large,  PEQ  tends  to  perform  slightly  betterjthan  PEa.  On  the  other  hand, 
when  a  is  small  and  the  number  of  samples  is  small,  PEQ  slightly  compares  favorably 
with  PEq.  Overall,  these  numerical  results  well  agree  with  our  theory. 

3.2  Parametric  Variance  Analysis 

Next,  we  analyze  the  asymptotic  variance  of  the  PE  divergence  estimator  PEQ  (7)  under 
a  parametric  setup. 

3.2.1  Theoretical  Results 

As  the  function  space  Q  in  Eq.(9),  we  consider  the  following  parametric  model: 

Q  =  {g{x-Q )  |  0e0cMfc}, 

where  b  is  a  finite  number.  Here  we  assume  that  the  above  parametric  model  is  correctly 
specified,  i.e.,  it  includes  the  true  relative  density-ratio  function  ra(x)\  there  exists  6* 
such  that 


g(x-6*)  =  ra(x). 


Here,  we  use  RuLSIF  without  regularization,  i.e.,  A  =  0  in  Eq.(9). 
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(a)  P'  =  iV(0, 1):  P  and  P'  are  the  same. 
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(b)  P'  =  ./V(0,0.6):  P'  has  smaller  standard  deviation  than  P. 
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(c)  P'  =  7V(0 , 2):  P'  has  larger  standard  deviation  than  P. 


y  in' 


(d)  P'  =  N( 0.5, 1):  P  and  P'  have  different  means. 


(e)  P'  =  0. 95^(0, 1)  +  0.057V(3, 1):  P'  contains  an  additional  component  to  P. 


Figure  2:  Illustrative  examples  of  divergence  estimation  by  RuLSIF.  From  left  to  right: 
true  density-ratios  for  a  =  0,  0.5,  and  0.95  (P  =  iV (0, 1)) ,  and  estimation  error  of  PE 
divergence  for  a  =  0,  0.5,  and  0.95. 
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Let  us  denote  the  variance  of  PEQ  (7)  by  V[PEa],  where  randomness  comes  from  the 
draw  of  samples  {ccj}”=1  and  {aj'-}”=1.  Then,  under  a  standard  regularity  condition  for 

the  asymptotic  normality  (see  Section  3  of  van  der  Vaart,  2000),  V[PEa]  can  be  expressed 
and  upper-bounded  as 


V[PEJ  =  Vp(x) 
n 


rn  - 


ara(xf 


+  ,^p'(  x) 

n! 


(1  -  a)rQ(xf 


(1  1 

+  °(  -,  — 

\n  n' 


< 


Dt-  NoO 

n 


+ 


*  II '  a  ||oo 


4n 


+ 


(1-a)5 


&  II  OO 


4  n' 


+  o 


1  1 


n  n 


(13) 

(14) 


Let  us  denote  the  variance  of  PEa  by  V[PEQ].  Then,  under  a  standard  regularity 
condition  for  the  asymptotic  normality  (see  Section  3  of  van  der  Vaart,  2000),  the  variance 
of  PEa  is  asymptotically  expressed  as 


V[PEq] 


ra  +  (1  ~  ara)'Ep{x)[\7g}TUal\/g 
2 


(1  -  a)rJEp(a.)[Vff]TE/a1Vff 
2 


(15) 


where  S7g  is  the  gradient  vector  of  g  with  respect  to  G  at  6  =  9* ,  i.e., 


(Vp(a:;  0*))j 


dg(x-,0*) 

dOj 


The  matrix  Ua  is  dehned  by 

Ua  =  a Ep{x)[\7g\/gT]  +  (1  -  a)E pgx)[\7 g\7 gT}. 

3.2.2  Interpretation 

Eq.(13)  shows  that,  up  to  0(-,  ^7),  the  variance  of  PEa  depends  only  on  the  true  relative 
density-ratio  ra(x),  not  on  the  estimator  of  ra{x).  This  means  that  the  model  complexity 
does  not  affect  the  asymptotic  variance.  Therefore,  overfitting  would  hardly  occur  in  the 
estimation  of  the  relative  PE  divergence  even  when  complex  models  are  used.  We  note 
that  the  above  superior  property  is  applicable  only  to  relative  PE  divergence  estimation, 
not  to  relative  density-ratio  estimation.  This  implies  that  overfitting  occurs  in  relative 
density-ratio  estimation,  but  the  approximation  error  cancels  out  in  relative  PE  divergence 
estimation. 

On  the  other  hand,  Eq.(15)  shows  that  the  variance  of  PEQ  is  affected  by  the  model 
Q,  since  the  factor  E. g]T Uf1^/ g  depends  on  the  model  complexity  in  general.  When 
the  equality 


^P{x)^gYUal\/g{x\G*)  =  ra(x) 
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holds,  the  variances  of  PEQ  and  PEQ  are  asymptotically  the  same.  However,  in  general, 
the  use  of  PEQ  would  be  more  recommended. 

Eq.(14)  shows  that  the  variance  V[PEa]  can  be  upper-bounded  by  the  quantity  de¬ 
pending  on  1 1 rQ  | |oo,  which  is  monotonically  lowered  if  llr^Hoo  is  reduced.  Since  Ht-qHoo 
monotonically  decreases  as  a  increases,  our  proposed  approach  of  estimating  the  a-relative 
PE  divergence  (with  a  >  0)  would  be  more  advantageous  than  the  naive  approach  of  esti¬ 
mating  the  plain  PE  divergence  (which  corresponds  to  a  =  0)  in  terms  of  the  parametric 
asymptotic  variance. 

3.2.3  Numerical  Illustration 

Here,  we  show  some  numerical  results  for  illustrating  the  above  theoretical  results  using 
the  one-dimensional  datasets  (b)  and  (c)  in  Section  2.4.  Let  us  define  the  parametric 
model  as 


Qk  =  \  g(x;9)  = 


r(x;  0) 


ar(x ;  6)  +  1  —  a 


ri  x\  9 )  =  exp  [  0,/  j ,  9  e  Rt+1 


(16) 


£=0 


The  dimension  of  the  model  Gk  is  equal  to  k  +  1.  The  o-relative  density-ratio  ra(x)  can 
be  expressed  using  the  ordinary  density-ratio  r(x)  =  p(x)/p'{x)  as 

r(x) 

r“  x  =  — f  \  L  -i - ' 

ar(x)  +  1  —  Q! 

Thus,  when  k  >  1,  the  above  model  Gk  includes  the  true  relative  density-ratio  ra{x)  of  the 
datasets  (b)  and  (c).  We  test  RuLSIF  with  a  =  0.2  and  0.8  for  the  model  (16)  with  degree 
k  —  1,  2, . . . ,  8.  The  parameter  6  is  learned  so  that  Eq.(9)  is  minimized  by  a  quasi-Newton 
method. 

The  standard  deviations  of  PEa  and  PEa  for  the  datasets  (b)  and  (c)  are  depicted 
in  Figure  3  and  Figure  4,  respectively.  The  graphs  show  that  the  degree  of  models  does 
not  significantly  affect  the  standard  deviation  of  PEQ  (i.e. ,  no  overfitting),  as  long  as  the 
model  includes  the  true  relative  density-ratio  (i.e.,  k  >  1).  On  the  other  hand,  bigger 
models  tend  to  produce  larger  standard  deviations  in  PEQ.  Thus,  the  standard  deviation 
of  PEq  more  strongly  depends  on  the  model  complexity. 


4  Experiments 

In  this  section,  we  experimentally  evaluate  the  performance  of  the  proposed  method  in 
two-sample  homogeneity  test,  outlier  detection,  and  transfer  learning  tasks. 

4.1  Two-Sample  Homogeneity  Test 

First,  we  apply  the  proposed  divergence  estimator  to  two-sample  homogeneity  test. 
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PEa  with  a  =  0.2 


Number  of  samples  (n=n') 


PEa  with  a  =  0.8 


PEa  with  a  =  0.8 


Figure  3:  Standard  deviations  of  PE  estimators  for  dataset  (b)  (i.e. ,  P  =  N( 0, 1)  and 
P'  =  N( 0,  0.6))  as  functions  of  the  sample  size  n  =  n! . 


PEa  with  a  =  0.2  PEa  with  a  =  0.8 

Figure  4:  Standard  deviations  of  PE  estimators  for  dataset  (c)  (i.e.,  P  =  1V(0, 1)  and 
P'  =  1V(0,  2))  as  functions  of  the  sample  size  n  =  n' . 
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4.1.1  Divergence-Based  Two-Sample  Homogeneity  Test 

Given  two  sets  of  samples  X  =  1-~'  P  and  X'  =  {x'^'=l  1~'  P',  the  goal  of  the 

two-sample  homogeneity  test  is  to  test  the  null  hypothesis  that  the  probability  distribu¬ 
tions  P  and  P'  are  the  same  against  its  complementary  alternative  (i.e.,  the  distributions 
are  different). 

By  using  an  estimator  Div  of  some  divergence  between  the  two  distributions  P  and  P', 
homogeneity  of  two  distributions  can  be  tested  based  on  the  permutation  test  procedure 
(Efron  and  Tibshirani,  1993)  as  follows: 

•  Obtain  a  divergence  estimate  Div  using  the  original  datasets  X  and  X' . 

•  Randomly  permute  the  \X  U  X'\  samples,  and  assign  the  first  \X\  samples  to  a  set 
X  and  the  remaining  \X'\  samples  to  another  set  X' . 

•  Obtain  a  divergence  estimate  Div  using  the  randomly  shuffled  datasets  X  and  X' 
(note  that,  since  X  and  X'  can  be  regarded  as  being  drawn  from  the  same  distribu¬ 
tion,  Div  tends  to  be  close  to  zero). 

•  Repeat  this  random  shuffling  procedure  many  times,  and  construct  the  empirical 
distribution  of  Div  under  the  null  hypothesis  that  the  two  distributions  are  the 
same. 

•  Approximate  the  p-value  by  evaluating  the  relative  ranking  of  the  original  Div  in 
the  distribution  of  Div. 

When  an  asymmetric  divergence  such  as  the  KL  divergence  (Kullback  and  Leibler, 
1951)  or  the  PE  divergence  (Pearson,  1900)  is  adopted  for  two-sample  homogeneity  test, 
the  test  results  depend  on  the  choice  of  directions',  a  divergence  from  P  to  P'  or  from  P' 
to  P.  (Sugiyama  et  ah,  2011)  proposed  to  choose  the  direction  that  gives  a  smaller  p- 
value — it  was  experimentally  shown  that,  when  the  uLSIF-based  PE  divergence  estimator 
is  used  for  the  two-sample  homogeneity  test  (which  is  called  the  least-squares  two-sample 
homogeneity  test ;  LSTT),  the  heuristic  of  choosing  the  direction  with  a  smaller  p-value 
contributes  to  reducing  the  type-II  error  (the  probability  of  accepting  incorrect  null- 
hypotheses,  i.e.,  two  distributions  are  judged  to  be  the  same  when  they  are  actually 
different),  while  the  increase  of  the  type- 1  error  (the  probability  of  rejecting  correct  null- 
hypotheses,  i.e.,  two  distributions  are  judged  to  be  different  when  they  are  actually  the 
same)  is  kept  moderate. 

Below,  we  refer  to  LSTT  with  p{x)/p'(x)  as  the  plain  LSTT ,  LSTT  with  p'(x)/p(x) 
as  the  reciprocal  LSTT ,  and  LSTT  with  heuristically  choosing  the  one  with  a  smaller 
p-value  as  the  adaptive  LSTT. 

4.1.2  Artificial  Datasets 

We  illustrate  how  the  proposed  method  behaves  in  two-sample  homogeneity  test  scenarios 
using  the  artificial  datasets  (a)-(d)  described  in  Section  2.4.  We  test  the  plain  LSTT, 
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reciprocal  LSTT,  and  adaptive  LSTT  for  a  =  0,  0.5,  and  0.95,  with  significance  level  5%. 

The  experimental  results  are  shown  in  Figure  5.  For  the  dataset  (a)  where  P  —  P' 
(i.e.,  the  null  hypothesis  is  correct),  the  plain  LSTT  and  reciprocal  LSTT  correctly  accept 
the  null  hypothesis  with  probability  approximately  95%.  This  means  that  the  type-I  error 
is  properly  controlled  in  these  methods.  On  the  other  hand,  the  adaptive  LSTT  tends  to 
give  slightly  lower  acceptance  rates  than  95%  for  this  toy  dataset,  but  the  adaptive  LSTT 
with  a  =  0.5  still  works  reasonably  well.  This  implies  that  the  heuristic  of  choosing  the 
method  with  a  smaller  p-value  does  not  have  critical  influence  on  the  type-I  error. 

In  the  datasets  (b),  (c),  and  (d),  P  is  different  from  P'  (i.e.,  the  null  hypothesis  is  not 
correct),  and  thus  we  want  to  reduce  the  acceptance  rate  of  the  incorrect  null-hypothesis 
as  much  as  possible.  In  the  plain  setup  for  the  dataset  (b)  and  the  reciprocal  setup  for  the 
dataset  (c),  the  true  density-ratio  functions  with  a  =  0  diverge  to  infinity,  and  thus  larger 
a  makes  the  density-ratio  approximation  more  reliable.  However,  a  =  0.95  does  not  work 
well  because  it  produces  an  overly-smoothed  density-ratio  function  and  thus  it  is  hard  to 
be  distinguished  from  the  completely  constant  density-ratio  function  (which  corresponds 
to  P  =  P').  On  the  other  hand,  in  the  reciprocal  setup  for  the  dataset  (b)  and  the  plain 
setup  for  the  dataset  (c),  small  a  performs  poorly  since  density-ratio  functions  with  large 
a  can  be  more  accurately  approximated  than  those  with  small  a  (see  Figure  1).  In  the 
adaptive  setup,  large  a  tends  to  perform  slightly  better  than  small  a  for  the  datasets  (b) 
and  (c). 

In  the  dataset  (d),  the  true  density-ratio  function  with  a  =  0  diverges  to  infinity  for 
both  the  plain  and  reciprocal  setups.  In  this  case,  middle  a  performs  the  best,  which 
well  balances  the  trade-off  between  high  distinguishability  from  the  completely  constant 
density-ratio  function  (which  corresponds  to  P  =  P')  and  easy  approximability.  The  same 
tendency  that  middle  a  works  well  can  also  be  mildly  observed  in  the  adaptive  LSTT  for 
the  dataset  (d). 

Overall,  if  the  plain  LSTT  (or  the  reciprocal  LSTT)  is  used,  small  a  (or  large  a) 
sometimes  works  excellently.  However,  it  performs  poorly  in  other  cases  and  thus  the 
performance  is  unstable  depending  on  the  true  distributions.  The  plain  LSTT  (or  the 
reciprocal  LSTT)  with  middle  a  tends  to  perform  reasonably  well  for  all  datasets.  On 
the  other  hand,  the  adaptive  LSTT  was  shown  to  nicely  overcome  the  above  instability 
problem  when  a  is  small  or  large.  However,  when  a  is  set  to  be  a  middle  value,  the  plain 
LSTT  and  the  reciprocal  LSTT  both  give  similar  results  and  thus  the  adaptive  LSTT 
provides  only  a  small  amount  of  improvement. 

Our  empirical  finding  is  that,  if  we  have  prior  knowledge  that  one  distribution  has  a 
wider  support  than  the  other  distribution,  assigning  the  distribution  with  a  wider  support 
to  P'  and  setting  a  to  be  a  large  value  seem  to  work  well.  If  there  is  no  knowledge  on 
the  true  distributions  or  two  distributions  have  less  overlapped  supports,  using  middle  a 
in  the  adaptive  setup  seems  to  be  a  reasonable  choice. 

We  will  systematically  investigate  this  issue  using  more  complex  datasets  below. 
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(b)  P'  =  N( 0,  0.6):  P'  has  smaller  standard  deviation  than  P. 


100  200  300 

Number  of  samples  (n  =  n') 


100  200  300 

Number  of  samples  (n  =  n') 


(c)  P'  =  N(0,  2):  P'  has  larger  standard  deviation  than  P. 


(d)  P'  =  N(0.5, 1):  P  and  P'  have  different  means. 


Figure  5:  Illustrative  examples  of  two-sample  homogeneity  test  based  on  relative  diver¬ 
gence  estimation.  From  left  to  right:  true  densities  (P  =  N( 0, 1)),  the  acceptance  rate  of 
the  null  hypothesis  under  the  significance  level  5%  by  plain  LSTT,  reciprocal  LSTT,  and 
adaptive  LSTT. 
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4.1.3  Benchmark  Datasets 

Here,  we  apply  the  proposed  two-sample  homogeneity  test  to  the  binary  classification 
datasets  taken  from  the  IDA  repository  (Ratsch  et  ah,  2001). 

We  test  the  adaptive  LSTT  with  the  RuLSIF-based  PE  divergence  estimator  for  a  =  0, 
0.5,  and  0.95;  we  also  test  the  maximum  mean  discrepancy  (MMD;  Borgwardt  et  ah,  2006), 
which  is  a  kernel-based  two-sample  homogeneity  test  method.  The  performance  of  MMD 
depends  on  the  choice  of  the  Gaussian  kernel  width.  Here,  we  adopt  a  version  proposed  by 
(Sriperumbudur  et  ah,  2009),  which  automatically  optimizes  the  Gaussian  kernel  width. 
The  p-values  of  MMD  are  computed  in  the  same  way  as  LSTT  based  on  the  permutation 
test  procedure. 

First,  we  investigate  the  rate  of  accepting  the  null  hypothesis  when  the  null  hypothesis 
is  correct  (i.e. ,  the  two  distributions  are  the  same).  We  split  all  the  positive  training 
samples  into  two  sets  and  perform  two-sample  homogeneity  test  for  the  two  sets  of  samples. 
The  experimental  results  are  summarized  in  Table  1,  showing  that  the  adaptive  LSTT 
with  a  =  0.5  compares  favorably  with  those  with  a  =  0  and  1  and  MMD  in  terms  of  the 
type-I  error. 

Next,  we  consider  the  situation  where  the  null  hypothesis  is  not  correct  (i.e.,  the  two 
distributions  are  different).  The  numerator  samples  are  generated  in  the  same  way  as 
above,  but  a  half  of  denominator  samples  are  replaced  with  negative  training  samples. 
Thus,  while  the  numerator  sample  set  contains  only  positive  training  samples,  the  denom¬ 
inator  sample  set  includes  both  positive  and  negative  training  samples.  The  experimental 
results  are  summarized  in  Table  2,  showing  that  the  adaptive  LSTT  with  a  =  0.5  again 
compares  favorably  with  those  with  a  =  0  and  1.  Furthermore,  LSTT  with  a  =  0.5  tends 
to  outperform  MMD  in  terms  of  the  type-II  error. 

Overall,  LSTT  with  a  =  0.5  is  shown  to  be  a  useful  method  for  two-sample  homo¬ 
geneity  test. 

4.2  Inlier-Based  Outlier  Detection 

Next,  we  apply  the  proposed  method  to  outlier  detection. 

4.2.1  Density-Ratio  Approach  to  Inlier-Based  Outlier  Detection 

Let  us  consider  an  outlier  detection  problem  of  finding  irregular  samples  in  a  dataset 
(called  an  “evaluation  dataset” )  based  on  another  dataset  (called  a  “model  dataset” )  that 
only  contains  regular  samples.  Defining  the  density  ratio  over  the  two  sets  of  samples,  we 
can  see  that  the  density-ratio  values  for  regular  samples  are  close  to  one,  while  those  for 
outliers  tend  to  be  significantly  deviated  from  one.  Thus,  density-ratio  values  could  be 
used  as  an  index  of  the  degree  of  outlyingness  (Smola  et  al.,  2009;  Hido  et  ah,  2011). 

Since  the  evaluation  dataset  usually  has  a  wider  support  than  the  model  dataset, 
we  regard  the  evaluation  dataset  as  samples  corresponding  to  the  denominator  density 
p'(x),  and  the  model  dataset  as  samples  corresponding  to  the  numerator  density  p(x). 
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Table  1:  Experimental  results  of  two-sample  homogeneity  test  for  the  IDA  datasets.  The 
mean  (and  standard  deviation  in  the  bracket)  rate  of  accepting  the  null  hypothesis  (i.e., 
P  =  P')  under  the  significance  level  5%  is  reported.  The  two  sets  of  samples  are  both 
taken  from  the  positive  training  set  (i.e.,  the  null  hypothesis  is  correct).  Methods  having 
the  mean  acceptance  rate  0.95  according  to  the  one-sample  t-test  at  the  significance  level 
5%  are  specified  by  bold  face. 


Datasets 

d 

n  =  n' 

MMD 

LSTT 
(a  =  0.0) 

LSTT 
(a  =  0.5) 

LSTT 
(a  =  0.95) 

banana 

2 

100 

0.96(0.20) 

0.93(0.26) 

0.92(0.27) 

0.92(0.27) 

thyroid 

5 

19 

0.96(0.20) 

0.95(0.22) 

0.95(0.22) 

0.88  (0.33) 

titanic 

5 

21 

0.94(0.24) 

0.86  (0.35) 

0.92(0.27) 

0.89(0.31) 

diabetes 

8 

85 

0.96(0.20) 

0.87  (0.34) 

0.91(0.29) 

0.82  (0.39) 

breast-cancer 

9 

29 

0.98  (0.14) 

0.91(0.29) 

0.94(0.24) 

0.92(0.27) 

flare-solar 

9 

100 

0.93(0.26) 

0.91(0.29) 

0.95(0.22) 

0.93(0.26) 

heart 

13 

38 

1.00  (0.00) 

0.85  (0.36) 

0.91(0.29) 

0.93(0.26) 

german 

20 

100 

0.99  (0.10) 

0.91(0.29) 

0.92(0.27) 

0.89(0.31) 

ringnorm 

20 

100 

0.97(0.17) 

0.93(0.26) 

0.91(0.29) 

0.85  (0.36) 

waveform 

21 

66 

0.98  (0.14) 

0.92(0.27) 

0.93(0.26) 

0.88  (0.33) 

Table  2:  Experimental  results  of  two-sample  homogeneity  test  for  the  IDA  datasets.  The 
mean  (and  standard  deviation  in  the  bracket)  rate  of  accepting  the  null  hypothesis  (i.e., 
P  =  P')  under  the  significance  level  5%  is  reported.  The  set  of  samples  corresponding  to 
the  numerator  of  the  density  ratio  is  taken  from  the  positive  training  set  and  the  set  of 
samples  corresponding  to  the  denominator  of  the  density  ratio  is  taken  from  the  positive 
training  set  and  the  negative  training  set  (i.e.,  the  null  hypothesis  is  not  correct).  The 
best  method  having  the  lowest  mean  acceptance  rate  and  comparable  methods  according 
to  the  two-sample  t-test  at  the  significance  level  5%  are  specified  by  bold  face. 


Datasets 

d 

n  —  n ' 

MMD 

LSTT 
(a  =  0.0) 

LSTT 

(a  =  0.5) 

LSTT 
(a  =  0.95) 

banana 

2 

100 

0.52  (0.50) 

0.10(0.30) 

0.02(0.14) 

0.17(0.38) 

thyroid 

5 

19 

0.52(0.50) 

0.81  (0.39) 

0.65(0.48) 

0.80  (0.40) 

titanic 

5 

21 

0.87(0.34) 

0.86(0.35) 

0.87(0.34) 

0.88(0.33) 

diabetes 

8 

85 

0.31(0.46) 

0.42(0.50) 

0.47  (0.50) 

0.57  (0.50) 

breast-cancer 

9 

29 

0.87  (0.34) 

0.75(0.44) 

0.80  (0.40) 

0.79  (0.41) 

flare-solar 

9 

100 

0.51(0.50) 

0.81  (0.39) 

0.55(0.50) 

0.66(0.48) 

heart 

13 

38 

0.53  (0.50) 

0.28(0.45) 

0.40(0.49) 

0.62  (0.49) 

german 

20 

100 

0.56  (0.50) 

0.55  (0.50) 

0.44(0.50) 

0.68  (0.47) 

ringnorm 

20 

100 

0.00(0.00) 

0.00(0.00) 

0.00(0.00) 

0.02(0.14) 

waveform 

21 

66 

0.00(0.00) 

0.00(0.00) 

0.02(0.14) 

0.00(0.00) 
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Tabic  3:  Mean  AUC  score  (and  the  standard  deviation  in  the  bracket)  over  1000  trials  for 
the  artificial  outlier-detection  dataset.  The  best  method  in  terms  of  the  mean  AUC  score 
and  comparable  methods  according  to  the  two-sample  t-test  at  the  significance  level  5% 
are  specified  by  bold  face. 


Input 

dimensionality  d 

RuLSIF 

(a  =  0) 

RuLSIF 

(a  =  0.5) 

RuLSIF 
(a  =  0.95) 

1 

.933(.089) 

.926(.100) 

.896  (.124) 

5 

.882(.099) 

.891(.091) 

.894(.086) 

10 

.842  (.107) 

.850(.103) 

.859(.092) 

Then,  outliers  tend  to  have  smaller  density-ratio  values  (i.e. ,  close  to  zero).  As  such, 
density-ratio  approximators  can  be  used  for  outlier  detection. 

When  evaluating  the  performance  of  outlier  detection  methods,  it  is  important  to  take 
into  account  both  the  detection  rate  (i.e.,  the  amount  of  true  outliers  an  outlier  detection 
algorithm  can  find)  and  the  detection  accuracy  (i.e.,  the  amount  of  true  iulicrs  an  outlier 
detection  algorithm  misjudges  as  outliers).  Since  there  is  a  trade-off  between  the  detection 
rate  and  the  detection  accuracy,  we  adopt  the  area  under  the  ROC  curve  (AUC)  as  our 
error  metric  (Bradley,  1997). 

4.2.2  Artificial  Datasets 

First,  we  illustrate  how  the  proposed  method  behaves  in  outlier  detection  scenarios  using 
artificial  datasets. 

Let 


P  =  N{Q,Id), 

P'  =  0.95A(0,  Id)  +  0.05A(3rf-1/2ld,  Id), 

where  d  is  the  dimensionality  of  x  and  1^  is  the  d-dimensional  vector  with  all  one.  Note 
that  this  setup  is  the  same  as  the  dataset  (e)  described  in  Section  2.4  when  d  =  1.  Here, 
the  samples  drawn  from  N(0,Id)  are  regarded  as  inkers,  while  the  samples  drawn  from 
N(d~1^2ld,  Id)  are  regarded  as  outliers.  We  use  n  —  n'  —  100  samples. 

Table  3  describes  the  AUC  values  for  input  dimensionality  d  —  1,5,  and  10  for  RuLSIF 
with  a  =  0,  0.5,  and  0.95.  This  shows  that,  as  the  input  dimensionality  d  increases,  the 
AUC  values  overall  get  smaller.  Thus,  outlier  detection  becomes  more  challenging  in 
high-dimensional  cases. 

The  result  also  shows  that  RuLSIF  with  small  a  tends  to  work  well  when  the  input 
dimensionality  is  low,  and  RuLSIF  with  large  a  works  better  as  the  input  dimensionality 
increases.  This  tendency  can  be  interpreted  as  follows:  If  a  is  small,  the  density-ratio 
function  tends  to  have  sharp  ‘hollow’  for  outlier  points  (see  the  leftmost  graph  in  Fig¬ 
ure  2(e)).  Thus,  as  long  as  the  true  density-ratio  function  can  be  accurately  estimated, 
small  a  would  be  preferable  in  outlier  detection.  When  the  data  dimensionality  is  low, 
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density-ratio  approximation  is  rather  easy  and  thus  small  a  tends  to  perform  well.  How¬ 
ever,  as  the  data  dimensionality  increases,  density-ratio  approximation  gets  harder,  and 
thus  large  a  which  produces  a  smoother  density-ratio  function  is  more  favorable  since 
such  a  smoother  function  can  be  more  easily  approximated  than  a  ‘bumpy’  one  produced 
by  small  a. 

4.2.3  Real-World  Datasets 

Next,  we  evaluate  the  proposed  outlier  detection  method  using  various  real-world  datasets: 

IDA  repository:  The  IDA  repository  (Ratsch  et  ah,  2001)  contains  various  binary  classi¬ 
fication  tasks.  Each  dataset  consists  of  positive /negative  and  training/test  samples. 
We  use  positive  training  samples  as  inlicrs  in  the  “model”  set.  In  the  “evaluation” 
set,  we  use  at  most  100  positive  test  samples  as  inkers  and  the  first  5%  of  negative 
test  samples  as  outliers.  Thus,  the  positive  samples  are  treated  as  inkers  and  the 
negative  samples  are  treated  as  outliers. 

Speech  dataset:  An  in-house  speech  dataset,  which  contains  short  utterance  samples 
recorded  from  2  male  subjects  speaking  in  French  with  sampling  rate  44.1kHz.  From 
each  utterance  sample,  we  extracted  a  50-dimensional  line  spectral  frequencies  vector 
(Kain  and  Macon,  1998).  We  randomly  take  200  samples  from  one  class  and  assign 
them  to  the  model  dataset.  Then  we  randomly  take  200  samples  from  the  same 
class  and  10  samples  from  the  other  class. 

20  Newsgroup  dataset:  The  20-Newsgroups  dataset1  contains  20000  newsgroup  docu¬ 
ments,  which  contains  the  following  4  top-level  categories:  ‘comp’,  ‘rec’,  ‘sci’,  and 
‘talk’.  Each  document  is  expressed  by  a  100-dimensional  bag-of- words  vector  of 
term-frequencies.  We  randomly  take  200  samples  from  the  ‘comp’  class  and  assign 
them  to  the  model  dataset.  Then  we  randomly  take  200  samples  from  the  same 
class  and  10  samples  from  one  of  the  other  classes  for  the  evaluation  dataset. 

The  USPS  hand-written  digit  dataset:  The  USPS  hand-written  digit  dataset2  con¬ 
tains  9298  digit  images.  Each  image  consists  of  256  (=  16  x  16)  pixels  and  each  pixel 
takes  an  integer  value  between  0  and  255  as  the  intensity  level.  We  regard  samples 
in  one  class  as  inlicrs  and  samples  in  other  classes  as  outliers.  We  randomly  take 
200  samples  from  the  inker  class  and  assign  them  to  the  model  dataset.  Then  we 
randomly  take  200  samples  from  the  same  inlier  class  and  10  samples  from  one  of 
the  other  classes  for  the  evaluation  dataset. 

We  compare  the  AUC  scores  of  RuLSIF  with  a  =  0,  0.5,  and  0.95,  and  one-class 
support  vector  machine  (OSVM)  with  the  Gaussian  kernel  (Scholkopf  et  ah,  2001).  We 
used  the  LIBSVM  implementation  of  OSVM  (Chang  and  Lin,  2001).  The  Gaussian 
width  is  set  to  the  median  distance  between  samples,  which  has  been  shown  to  be  a  useful 


1http : //people . csail .mit . edu/jrennie/20Newsgroups/ 

2http : //www . gaussianprocess . org/ gpml/ data/ 
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Table  4:  Experimental  results  of  outlier  detection  for  various  for  real-world  datasets. 
Mean  AUC  score  (and  standard  deviation  in  the  bracket)  over  100  trials  is  reported.  The 
best  method  having  the  highest  mean  AUC  score  and  comparable  methods  according  to 
the  two-sample  t-test  at  the  significance  level  5%  are  specified  by  bold  face.  The  datasets 
are  sorted  in  the  ascending  order  of  the  input  dimensionality  d. 


Datasets 

d 

OSVM 
{v  =  0.05) 

OSVM 
(v  —  0.1) 

RuLSIF 

(a  =  0) 

RuLSIF 
(a  =  0.5) 

RuLSIF 
(a  =  0.95) 

IDA:banana 

2 

.668(.105) 

.676(.120) 

.597  (.097) 

.619  (.101) 

.623  (.115) 

IDA:thyroid 

5 

.760  (.148) 

.782(.165) 

.804(.148) 

.796(.178) 

.722  (.153) 

IDA:  titanic 

5 

.757(.205) 

•  752(.191) 

.750(.182) 

.701  (.184) 

.712  (.185) 

IDA:diabetes 

8 

.636(.099) 

.610  (.090) 

.594  (.105) 

.575  (.105) 

.663(.112) 

IDA  :b- cancer 

9 

•  741(.160) 

.691  (.147) 

.707(.148) 

.737(.159) 

•733(.160) 

IDA:f-solar 

9 

.594  (.087) 

.590  (.083) 

.626(.102) 

.612(.100) 

.584  (.114) 

IDA:  heart 

13 

.714  (.140) 

.694  (.148) 

.748(.149) 

.769(.134) 

.726  (.127) 

IDA:german 

20 

.612(.069) 

.604(.084) 

.605(.092) 

•  597(.101) 

.605(.095) 

IDA:ringnorm 

20 

.991(.012) 

.993(.007) 

.944  (.091) 

.971  (.062) 

.992(.010) 

IDA:waveform 

21 

.812  (.107) 

.843  (.123) 

.879(.122) 

.875(.117) 

.885(.102) 

Speech 

50 

.788  (.068) 

.830(.060) 

.804  (.101) 

.821(.076) 

.836(.083) 

20News  (‘rec’) 

100 

.598  (.063) 

.593  (.061) 

.628  (.105) 

.614  (.093) 

.767(.100) 

20News  (‘sci’) 

100 

.592  (.069) 

.589  (.071) 

.620  (.094) 

.609  (.087) 

.704(.093) 

20News  (‘talk’) 

100 

.661  (.084) 

.658  (.084) 

.672  (.117) 

.670  (.102) 

.823(.078) 

USPS  (1  vs.  2) 

256 

.889  (.052) 

.926(.037) 

.848  (.081) 

.878  (.088) 

.898  (.051) 

USPS  (2  vs.  3) 

256 

.823  (.053) 

.835  (.050) 

.803  (.093) 

.818  (.085) 

.879(.074) 

USPS  (3  vs.  4) 

256 

.901  (.044) 

.939  (.031) 

.950  (.056) 

.961  (.041) 

.984(.016) 

USPS  (4  vs.  5) 

256 

.871  (.041) 

.890  (.036) 

.857  (.099) 

.874  (.082) 

.941(.031) 

USPS  (5  vs.  6) 

256 

.825  (.058) 

.859  (.052) 

.863  (.078) 

.867  (.068) 

.901(.049) 

USPS  (6  vs.  7) 

256 

.910  (.034) 

.950  (.025) 

.972  (.038) 

.984  (.018) 

.994(.010) 

USPS  (7  vs.  8) 

256 

.938  (.030) 

.967  (.021) 

.941  (.053) 

.951  (.039) 

.980(.015) 

USPS  (8  vs.  9) 

256 

.721  (.072) 

.728  (.073) 

.721  (.084) 

.728  (.083) 

.761(.096) 

USPS  (9  vs.  0) 

256 

.920  (.037) 

.966  (.023) 

.982  (.048) 

.989  (.022) 

.994(.011) 

heuristic  (Scholkopf  et  ah,  2001).  Since  there  is  no  systematic  method  to  determine  the 
tuning  parameter  v  in  OSVM,  we  report  the  results  for  v  =  0.05  and  0.1. 

The  mean  and  standard  deviation  of  the  AUC  scores  over  100  runs  with  random  sample 
choice  are  summarized  in  Table  4,  showing  that  RuLSIF  overall  compares  favorably  with 
OSVM.  Among  the  RuLSIF  methods,  small  a  tends  to  perform  well  for  low-dimensional 
datasets,  and  large  a  tends  to  work  well  for  high- dimensional  datasets.  This  tendency 
well  agrees  with  that  for  the  artificial  datasets  (see  Section  4.2.2). 
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4.3  Transfer  Learning 

Finally,  we  apply  the  proposed  method  to  transfer  learning. 

4.3.1  Transductive  Transfer  Learning  by  Importance  Sampling 

Let  us  consider  a  problem  of  semi-supervised  learning  (Chapclle  et  al.,  2006)  from  labeled 
training  samples  {(xj,  ?/*r)}”Lr1  and  unlabeled  test  samples  The  goal  is  to  predict 

a  test  output  value  yte  for  a  test  input  point  xte.  Here,  we  consider  the  setup  where 
the  labeled  training  samples  {(a^r, are  drawn  i.i.d.  from  p(y\x)ptT(x) ,  while  the 
unlabeled  test  samples  {cc)6}^  are  drawn  i.i.d.  from  pte(x),  which  is  generally  different 
from  pti(x );  the  (unknown)  test  sample  ( xte,yte )  follows  p(y\x)pte{x) .  This  setup  means 
that  the  conditional  probability  p(y\x)  is  common  to  training  and  test  samples,  but  the 
marginal  densities  ptr(x)  and  pte(x)  are  generally  different  for  training  and  test  input 
points.  Such  a  problem  is  called  transductive  transfer  learning  (Pan  and  Yang,  2010), 
domain  adaptation  (Jiang  and  Zliai,  2007),  or  covariate  shift  (Shimodaira,  2000;  Sugiyama 
and  Kawanabe,  2012). 

Let  loss(r/,  y )  be  a  point-wise  loss  function  that  measures  a  discrepancy  between  y  and 
y  (at  input  x).  Then  the  generalization  error  which  we  would  like  to  ultimately  minimize 
is  defined  as 


^p(y\x)pte(x)  [loSs(?/,  f(x))]  , 


where  f(x)  is  a  function  model.  Since  the  generalization  error  is  inaccessible  because  the 
true  probability  p(y\x)pte(x)  is  unknown,  empirical-error  minimization  is  often  used  in 
practice  (Vapnik,  1998): 


However,  under  the  covariate  shift  setup,  plain  empirical-error  minimization  is  not  con¬ 
sistent  (i.e.,  it  does  not  converge  to  the  optimal  function)  if  the  model  T  is  misspecified 
(i.e.,  the  true  function  is  not  included  in  the  model;  see  Shimodaira,  2000).  Instead,  the 
following  importance-weighted  empirical-error  minimization  is  consistent  under  covariate 
shift: 


where  r(x)  is  called  the  importance  (Fishman,  1996)  in  the  context  of  covariate  shift 
adaptation: 


Pte(x) 
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However,  since  importance-weighted  learning  is  not  statistically  efficient  (i.e. ,  it  tends 
to  have  larger  variance),  slightly  flattening  the  importance  weights  is  practically  useful  for 
stabilizing  the  estimator.  (Shimodaira,  2000)  proposed  to  use  the  exponentially- flattened 
importance  weights  as 


ntr 

min  —  J2r(xf)Tloss(yfj(xJ))  > 

f&T  ntT  f 

L  3= 1  J 

where  0  <  r  <  1  is  called  the  exponential  flattening  parameter,  r  =  0  corresponds  to  plain 
empirical-error  minimization,  while  r  =  1  corresponds  to  importance-weighted  empirical- 
error  minimization;  0  <  r  <  1  will  give  an  intermediate  estimator  that  balances  the  trade¬ 
off  between  statistical  efficiency  and  consistency.  The  exponential  flattening  parameter  r 
can  be  optimized  by  model  selection  criteria  such  as  the  importance-weighted  Akaike  infor¬ 
mation  criterion  for  regular  models  (Shimodaira,  2000),  the  importance-weighted  subspace 
information  criterion  for  linear  models  (Sugiyama  and  Muller,  2005),  and  importance- 
weighted  cross-validation  for  arbitrary  models  (Sugiyama  et  ah,  2007). 

One  of  the  potential  drawbacks  of  the  above  exponential  flattering  approach  is  that 
estimation  of  r(x)  (i.e.,  r  =  1)  is  rather  hard,  as  shown  in  this  paper.  Thus,  when  r(x)  is 
estimated  poorly,  all  flattened  weights  r(x)T  are  also  unreliable  and  then  covariate  shift 
adaptation  does  not  work  well  in  practice.  To  cope  with  this  problem,  we  propose  to  use 
relative  importance  weights  alternatively: 


min 


1 

«tr 


ntr 

/(*?)) 

3= 1 


where  ra(x)  (0  <  a  <  1)  is  the  a-relative  importance  weight  defined  by 

_ Pte(x _ 

(1  -  a)pte(x)  +  aptr(x)' 

Note  that,  compared  with  the  definition  of  the  a-relative  density-ratio  (1),  a  and  (1  —  a) 
are  swapped  in  order  to  be  consistent  with  exponential  flattening.  Indeed,  the  relative 
importance  weights  play  a  similar  role  to  exponentially-flattened  importance  weights; 
a  =  0  corresponds  to  plain  empirical-error  minimization,  while  a  =  1  corresponds  to 
importance-weighted  empirical-error  minimization;  0  <  a  <  1  will  give  an  intermediate 
estimator  that  balances  the  trade-off  between  efficiency  and  consistency.  We  note  that 
the  relative  importance  weights  and  exponentially  flattened  importance  weights  agree  only 
when  a  =  r  =  0  and  a  =  r  =  1;  for  0  <  a  =  r  <  1,  they  are  generally  different. 

A  possible  advantage  of  the  above  relative  importance  weights  is  that  its  estimation  for 
0  <  a  <  1  does  not  depend  on  that  for  a  =  1,  unlike  exponentially- flattened  importance 
weights.  Since  cc-relative  importance  weights  for  0  <  a  <  1  can  be  reliably  estimated  by 
RuLSIF  proposed  in  this  paper,  the  performance  of  covariate  shift  adaptation  is  expected 
to  be  improved.  Below,  we  experimentally  investigate  this  effect. 


ra(x)  : 
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4.3.2  Artificial  Datasets 

First,  we  illustrate  how  the  proposed  method  behaves  in  covariate  shift  adaptation  using 
one-dimensional  artificial  datasets. 

In  this  experiment,  we  employ  the  following  kernel  regression  model: 


nte 

/0;/3)  =  J^Aexp 

i= 1 


where  f3  =  (/?i , . . .  ,^nte)T  is  the  parameter  to  be  learned  and  p  is  the  Gaussian  width. 
The  parameter  (3  is  learned  by  relative  importance-weighted,  least-squares  (RIW-LS): 


/3riw-ls  =  argmin 
/3 


7T.tr 


-£?«(*?)  U(*f;0)-vTY 

1  Ltr 

3= 1 


or  exponentially-flattened  importance-weighted  least-squares  (EIW-LS): 


/^eiw-ls  =  argmin 
/3 


77-tr 


tr  j=i 


The  relative  importance  weight  rQ(x*r)  is  estimated  by  RuLSIF,  and  the  exponentially- 
flattened  importance  weight  r(xJ)T  is  estimated  by  uLSIF  (i.e.,  RuLSIF  with  a  =  1). 
The  Gaussian  width  p  is  chosen  by  5-fold  importance-weighted  cross-validation  (Sugiyama 
et  ah,  2007). 

First,  we  consider  the  case  where  input  distributions  do  not  change: 


Ptr  =  Pte  =  N(  1,0.25). 

The  densities  and  their  ratios  are  plotted  in  Figure  6(a).  The  training  output  samples 
{yf}]=  i  are  generated  as 


yf  =  smc(xf)  +  ef, 

where  {e)r}"Lr1  is  additive  noise  following  A(0,0.01).  We  set  ntr  =  100  and  nte  =  200. 
Figure  6(b)  shows  a  realization  of  training  and  test  samples  as  well  as  learned  functions 
obtained  by  RIW-LS  with  a  =  0.5  and  EIW-LS  with  r  =  0.5.  This  shows  that  RIW- 
LS  with  a  =  0.5  and  EIW-LS  with  r  =  0.5  give  almost  the  same  functions,  and  both 
functions  fit  the  true  function  well  in  the  test  region.  Figure  6(c)  shows  the  mean  and 
standard  deviation  of  the  test  error  under  the  squared  loss  over  200  runs,  as  functions  of 
the  relative  flattening  parameter  a  in  RIW-LS  and  the  exponential  flattening  parameter 
r  in  EIW-LS.  The  method  having  a  lower  mean  test  error  and  another  method  that  is 
comparable  according  to  the  two-sample  t-test  at  the  significance  level  5%  are  specified 
by  ‘o’.  As  can  be  observed,  the  proposed  RIW-LS  compares  favorably  with  EIW-LS. 
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(a)  Densities  and  ratios  (b)  Learned  functions  (c)  Test  error 

Figure  6:  Illustrative  example  of  transfer  learning  under  no  distribution  change. 


(a)  Densities  and  ratios  (b)  Learned  functions  (c)  Test  error 

Figure  7:  Illustrative  example  of  transfer  learning  under  covariate  shift. 


Next,  we  consider  the  situation  where  input  distribution  changes  (Figure  7(a)): 

Ptr  =  iV(l,0.25), 

Pte  =  iV(2,0.1). 

The  output  values  are  created  in  the  same  way  as  the  previous  case.  Figure  7(b)  shows  a 
realization  of  training  and  test  samples  as  well  as  learned  functions  obtained  by  RIW-LS 
with  a  =  0.5  and  EIW-LS  with  r  =  0.5.  This  shows  that  RIW-LS  with  a  =  0.5  fits  the 
true  function  slightly  better  than  EIW-LS  with  r  =  0.5  in  the  test  region.  Figure  7(c) 
shows  that  the  proposed  RIW-LS  tends  to  outperform  EIW-LS,  and  the  standard  devia¬ 
tion  of  the  test  error  for  RIW-LS  is  much  smaller  than  EIW-LS.  This  is  because  EIW-LS 
with  0  <  r  <  1  is  based  on  an  importance  estimate  with  r  =  1,  which  tends  to  have  high 
fluctuation.  Overall,  the  stabilization  effect  of  relative  importance  estimation  was  shown 
to  improve  the  test  accuracy. 

4.3.3  Real-World  Datasets 

Finally,  we  evaluate  the  proposed  transfer  learning  method  on  a  real-world  transfer  learn¬ 
ing  task. 
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Figure  8:  An  example  of  three-axis  accelerometer  data  for  “walking”  collected  by  iPod 
touch. 

We  consider  the  problem  of  human  activity  recognition  from  accelerometer  data  col¬ 
lected  by  iPod  touch3.  In  the  data  collection  procedure,  subjects  were  asked  to  perform  a 
specific  action  such  as  walking,  running,  and  bicycle  riding.  The  duration  of  each  task  was 
arbitrary  and  the  sampling  rate  was  20Hz  with  small  variations.  An  example  of  three-axis 
accelerometer  data  for  “walking”  is  plotted  in  Figure  8. 

To  extract  features  from  the  accelerometer  data,  each  data  stream  was  segmented  in 
a  sliding  window  manner  with  window  width  5  seconds  and  sliding  step  1  second.  De¬ 
pending  on  subjects,  the  position  and  orientation  of  iPod  touch  was  arbitrary — held  by 
hand  or  kept  in  a  pocket  or  a  bag.  For  this  reason,  we  decided  to  take  the  fb-norm  of 
the  3-dimensional  acceleration  vector  at  each  time  step,  and  computed  the  following  5 
orientation-invariant  features  from  each  window:  mean ,  standard  deviation ,  fluctuation  of 
amplitude ,  average  energy,  and  frequency- domain  entropy  (Bao  and  Intille,  2004;  Bharat- 
ula  et  ah,  2005). 

Let  us  consider  a  situation  where  a  new  user  wants  to  use  the  activity  recognition 
system.  However,  since  the  new  user  is  not  willing  to  label  his/her  accelerometer  data 
due  to  troublesomeness,  no  labeled  sample  is  available  for  the  new  user.  On  the  other 
hand,  unlabeled  samples  for  the  new  user  and  labeled  data  obtained  from  existing  users 
are  available.  Let  labeled  training  data  {{xj ,  ?y):r)}/=,  be  the  set  of  labeled  accelerometer 
data  for  20  existing  users.  Each  user  has  at  most  100  labeled  samples  for  each  action. 
Let  unlabeled  test  data  {£C*e}/Tj  be  unlabeled  accelerometer  data  obtained  from  the  new 
user. 

We  use  kernel  logistic  regression  (KLR)  for  activity  recognition.  We  compare  the 
following  four  methods: 


'  — ■  z-axis 


0- 


I- ^ 


y-axis 


■x-axis 
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Table  5:  Experimental  results  of  transfer  learning  in  real-world  human  activity  recogni¬ 
tion.  Mean  classification  accuracy  (and  the  standard  deviation  in  the  bracket)  over  100 
runs  for  activity  recognition  of  a  new  user  is  reported.  The  method  having  the  lowest 
mean  classification  accuracy  and  comparable  methods  according  to  the  two-sample  t-test 
at  the  significance  level  5%  are  specified  by  bold  face. 


Task 

KLR 

(a  =  0,  r  =  0) 

RIW-KLR 

(a  =  0.5) 

EIW-KLR 
(r  =  0.5) 

IW-KLR 

(a  =  1,  r  =  1) 

Walks  vs.  run 

0.803  (0.082) 

0.889(0.035) 

0.882(0.039) 

0.882(0.035) 

Walks  vs.  bicycle 

0.880  (0.025) 

0.892(0.035) 

0.867  (0.054) 

0.854  (0.070) 

Walks  vs.  train 

0.985  (0.017) 

0.992(0.008) 

0.989  (0.011) 

0.983  (0.021) 

•  Plain  KLR  without  importance  weights  (i.e.,  a  =  0  or  r  =  0). 

•  KLR  with  relative  importance  weights  for  a  =  0.5. 

•  KLR  with  exponentially-flattened  importance  weights  for  r  =  0.5. 

•  KLR  with  plain  importance  weights  (i.e.,  a  =  1  or  r  =  1). 

The  experiments  are  repeated  100  times  with  different  sample  choice  for  ntr  =  500  and 
rite  =  200.  Table  5  depicts  the  classification  accuracy  for  three  binary-classification  tasks: 
walk  vs.  run,  walk  vs.  riding  a  bicycle,  and  walk  vs.  taking  a  train.  The  classification 
accuracy  is  evaluated  for  800  samples  from  the  new  user  that  are  not  used  for  classifier 
training  (i.e.,  the  800  test  samples  are  different  from  200  unlabeled  samples).  The  table 
shows  that  KLR  with  relative  importance  weights  for  a  =  0.5  compares  favorably  with 
other  methods  in  terms  of  the  classification  accuracy.  KLR  with  plain  importance  weights 
and  KLR  with  exponentially-flattened  importance  weights  for  r  =  0.5  are  outperformed 
by  KLR  without  importance  weights  in  the  walk  vs.  riding  a  bicycle  task  due  to  the 
instability  of  importance  weight  estimation  for  a  =  1  or  r  =  1. 

Overall,  the  proposed  relative  density-ratio  estimation  method  was  shown  to  be  useful 
also  in  transfer  learning  under  co variate  shift. 

5  Conclusion 

In  this  paper,  we  proposed  to  use  a  relative  divergence  for  robust  distribution  comparison. 
We  gave  a  computationally  efficient  method  for  estimating  the  relative  Pearson  divergence 
based  on  direct  relative  density-ratio  approximation.  We  theoretically  elucidated  the  con¬ 
vergence  rate  of  the  proposed  divergence  estimator  under  non-parametric  setup,  which 
showed  that  the  proposed  approach  of  estimating  the  relative  Pearson  divergence  is  more 
preferable  than  the  existing  approach  of  estimating  the  plain  Pearson  divergence.  Fur¬ 
thermore,  we  proved  that  the  asymptotic  variance  of  the  proposed  divergence  estimator  is 
independent  of  the  model  complexity  under  a  correctly-specified  parametric  setup.  Thus, 
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the  proposed  divergence  estimator  hardly  overfits  even  with  complex  models.  Experi¬ 
mentally,  we  demonstrated  the  practical  usefulness  of  the  proposed  divergence  estimator 
in  two-sample  homogeneity  test,  inlicr-based  outlier  detection,  and  transductive  transfer 
learning  under  covariate  shift. 

In  addition  to  two-sample  homogeneity  test,  outlier  detection,  and  transfer  learning, 
density  ratios  were  shown  to  be  useful  for  tackling  various  machine  learning  problems, 
including  multi-task  learning  (Bickel  et  ah,  2008;  Simm  et  ah,  2011),  independence  test 
(Sugiyama  and  Suzuki,  2011),  feature  selection  (Suzuki  et  ah,  2009),  causal  inference 
(Yamada  and  Sugiyama,  2010),  independent  component  analysis  (Suzuki  and  Sugiyama, 
2011),  dimensionality  reduction  (Suzuki  and  Sugiyama,  2010),  unpaired  data  matching 
(Yamada  and  Sugiyama,  2011),  clustering  (Kimura  and  Sugiyama,  2011),  conditional 
density  estimation  (Sugiyama  et  ah,  2010),  and  probabilistic  classification  (Sugiyama, 
2010).  Thus,  it  would  be  promising  to  explore  more  applications  of  the  proposed  relative 
density-ratio  approximator  beyond  two-sample  homogeneity  test,  outlier  detection,  and 
transfer  learning  tasks. 
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A  Technical  Details  of  Non-Parametric  Convergence 
Analysis 

Here,  we  give  the  technical  details  of  the  non-parametric  convergence  analysis  described 
in  Section  3.1. 

A.l  Results 

For  notational  simplicity,  we  define  linear  operators  P,  Pn,  P' ,  P'n,  as 

yr  i  f(xi) 

P f  Ep/,  Pnf:=  , 

n 

P'f  :=  Eg/,  P'n,f  := 

n 

For  a  G  [0, 1],  we  define  Sny  and  S  as 

Snn'  =  aPn  +  (1  —  Oi)P'n ,,  S  =  aP  +  (1  —  a)P' . 
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We  estimate  the  Pearson  divergence  between  p  and  ap+  (1  —  a)q  through  estimating  the 
density  ratio 

*  ,=  P 

ap  +  (1  —  a)p' 

Let  us  consider  the  following  density  ratio  estimator: 


g  :  =  argmin 
g&Q 

=  argmin 
<?6S 


‘  (aP„  +  (1  -  a )P'n,)g2  -  Png  +  ^-R(g)2 
(\sn-alg2-Png+^R(gfY 


where  n  =  min (n,n')  and  R(g)  is  a  non-negative  regularization  functional  such  that 


sup[|0(®)|]  <  R(g).  (17) 


A  possible  estimator  of  the  Pearson  (PE)  divergence  PEa  is 

PEa  •  Png  ~^pn,rig  2' 

Another  possibility  is 


PEq  2 

A  useful  example  is  to  use  a  reproducing  kernel  Hilbert  space  (RKHS;  Aronszajn,  1950) 
as  Q  and  the  RKHS  norm  as  R(g).  Suppose  Q  is  an  RKHS  associated  with  bounded  kernel 
*(v): 


sup[fc(cc,  a?)]  <  C. 

X 


Let  ||  •  ||g  denote  the  norm  in  the  RKHS  Q.  Then  R(g)  =  V/C'||5,||g  satisfies  Eq.(17): 

g(x)  =  (k(x,-),g(-))  <  V 'k(x,x)\\g\\g  <  VC\\g\\g, 

where  we  used  the  reproducing  property  of  the  kernel  and  Schwartz’s  inequality.  Note 
that  the  Gaussian  kernel  satisfies  this  with  C  —  1.  It  is  known  that  the  Gaussian  kernel 
RKHS  spans  a  dense  subset  in  the  set  of  continuous  functions.  Another  example  of 
RKHSs  is  Sobolev  space.  The  canonical  norm  for  this  space  is  the  integral  of  the  squared 
derivatives  of  functions.  Thus  the  regularization  term  R(g)  =  ||g||g  imposes  the  solution 
to  be  smooth.  The  RKHS  technique  in  Sobolev  space  has  been  well  exploited  in  the 
context  of  spline  models  (Wahba,  1990).  We  intend  that  the  regularization  term  R(g) 
is  a  generalization  of  the  RKHS  norm.  Roughly  speaking,  R(g)  is  like  a  “norm”  of  the 
function  space  Q . 
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We  assume  that  the  true  density-ratio  function  g*(x)  is  contained  in  the  model  Q  and 
is  bounded  from  above: 

g*{x )  <  Mo  for  all  x  E  T>x- 
Let  Qm  be  a  ball  of  Q  with  radius  M  >  0: 


Qm  :=  {g  EG  |  R(g)  <  M}. 


To  derive  the  convergence  rate  of  our  estimator,  we  utilize  the  bracketing  entropy  that  is 
a  complexity  measure  of  a  function  class  (see  p.  83  of  van  der  Vaart  and  Wellner,  1996). 

Definition  1  Given  two  functions  l  and  u,  the  bracket  [ l,u ]  is  the  set  of  all  functions 
f  with  l(x)  <  f(x)  <  u(x)  for  all  x.  An  e-bracket  with  respect  to  L2(p)  is  a  bracket 
( l,u }  with  \\l  —  u\\l2(p)  <  e.  The  bracketing  entropy  TL\\(- T,  e,  L2(p))  is  the  logarithm  of  the 
minimum  number  of  e-brackets  with  respect  to  L2(p)  needed  to  cover  a  function  set  T . 

We  assume  that  there  exists  7  (0  <  7  <  2)  such  that,  for  all  M  >  0, 

'H[](f?M,e,L2(p))  =  O  ^  L2(p'))  =  O  (18) 

This  quantity  represents  a  complexity  of  function  class  Q — the  larger  7  is,  the  more 
complex  the  function  class  Q  is  because,  for  larger  7,  more  brackets  are  needed  to  cover 
the  function  class.  The  Gaussian  RKHS  satisfies  this  condition  for  arbitrarily  small  7 
(Steinwart  and  Scovel,  2007).  Note  that  when  R(g)  is  the  RKHS  norm,  the  condition 
(18)  holds  for  all  M  >  0  if  that  holds  for  M  =  1. 

Then  we  have  the  following  theorem. 


Theorem  1  Let  n  =  min(n,  n'),  M0  =  ||g*||oo?  and  c  =  (1  +  a)\J P(g*  —  Pg*)2  +  (1  — 
a)^/P'(g*  —  P'g*)2.  Under  the  above  setting,  if  Xn  — >  0  and  A^1  =  o(h2^2+7^)?  then  we 
have 

PEq  —  PEa  =  Op(\n  max(l,  R(g*)2)  +  fUl!'2cM0)J 


and 


PEa  —  PEq  =Op( An  max{l,  M( 


in— 21 

2>,R(g*)M, (f  2j 


,  R(g*)}  +  A I  ma x{M2,M2R(g*)}), 


where  Op  denotes  the  asymptotic  order  in  probability. 


I11  the  proof  of  Theorem  1,  we  use  the  following  auxiliary  lemma. 

Lemma  1  Under  the  setting  of  Theorem  1,  if  Xn  — >  0  and  A^1  =  o(h2//^2+7^),  then  we 
have 

\\d- 9*\\l2(S)  =  Opi^l/2  nrax{l,  R(g*)}),  R(g)  =  Op(max{l,  R(g*)}), 
where  ||  ■  ||l2(s)  denotes  the  L2(ap  +  (1  —  a)q)-norm. 
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A. 2  Proof  of  Lemma  1 


First,  we  prove  Lemma  1. 

From  the  definition,  we  obtain 

2 Sn,n'(]2  ~  Pn9  +  ^nP(g)2  <  -^Snyg*~  —  Png*  +  A nR(g*)2 
=>  \sny{g  -  g*f  -  Sny  (g*(g*  -  g))  -  Pn(g  -  g*)  +  \n(R(g)2  -  R(g*)2)  <  0. 

On  the  other  hand,  S(g*(g*  —  g))  =  P(g*  —  g)  indicates 

j(S  -  S„,„0(s  -  s’)2  -(S-  S^)(a* (a*  -  a))  -(p-  p„)(s  -  s’)  -  VWs)2  -  R(s’)2) 
>\s(g-g')2. 

Therefore,  to  bound  ||g  —  .9*IU2(S)>  it  suffices  to  bound  the  left-hand  side  of  the  above 
inequality. 

Define  Pm  and  P\  as 


Pm  ■=  {g  -  g*  |  g  e  Gm}  and  P2M  ■=  {f2  \  f  e  PM}- 

To  bound  (5  —  Sny)(g  —  g*)2 1,  we  need  to  bound  the  bracketing  entropies  of  9-9, .  We 
show  that 


His(Ptj,S,L2(q)) 


°(((M+,Mo)  T)- 

C,//(M  +  M0)yy 


This  can  be  shown  as  follows.  Let  A  and  fu  be  a  5-bracket  for  with  respect  to  L2(p); 
fL(x)  <  fu(x)  and  ||/l  —  /u||l2(p)  —  Without  loss  of  generality,  we  can  assume  that 
||/l|U«„  II/uIUoo  <  M  +  M0  .  Then  ff  and  fL  defined  as 

fu(x)  :  =  max{/|(x),/^(x)}, 

f  min{/|(x),  /£(z)}  (sign(/L(x))  =  sign(/u(a;))), 

L  |0  (otherwise) 

are  also  a  bracket  such  that  fL  <  g2  <  ff  for  all  g  €  s.t.  /l  <  g  <  fu  and 
II /l  _  fu\\L2(P)  <  25(M  +  M0)  because  ||/L  -  /c/||l2(p)  <  5  and  the  following  relation  is 
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met: 

/ ,/  /  s  _  ,/  ,  ^2  <  f  (fl(x)  -  fu(x)Y  (sign (fL(x))  =  sign (fu(x))), 

L  U  ~  \ma*-{ft(x)i  fu(x)}  (otherwise) 

<  |  (h(x)  ~  fu{x))2(fL(x)  +  fu{x))2  (sign (fL(x))  =  sign(/[/(x))), 

—  |ma  x{fi(x),fu(x)}  (otherwise) 

<  f (h{x)  -  fu{x))2(fL{x)  +  fu{x))2  (sign (fL(x))  =  sign (fu(x))), 

~  \(h(x)  -  fu(x))2(\fL(x)\  +  \fu(x)\)2  (otherwise) 

<4 (fL(x)  -  Mx))2(M  +  M0)2. 


Therefore  the  condition  for  the  bracketing  entropies  (18)  gives  6,  L2(p))  = 

O  y  We  can  aiso  show  that  S,  L2(q))  =  O  j  j  in  the 

same  fashion. 

Let  /  :=  'g  —  g* .  Then,  as  in  Lemma  5.14  and  Theorem  10.6  in  (van  de  Geer,  2000), 
we  obtain 


lOV'  -  s)(f2) I  <  a\(Pn  -  P)(/2)|  +  (1  -  a) | (P'n,  -  P')(/2) I 

=aOp  ^=||/2||Z/2(2p)(l  +  R(g )2  +  Mq)  2  V  n  2+7  (1  +  R{g)2  +  Mq)^ 

+  (1  -  a)Op  ^=|II/2|Il2(p')(1  +  -^(fi1)2  +  Mo) 2  v  h  2+7  (1  +  -R(fi')2  +  M?)^ 

(^l|/2|l^)(1  +  R (?)2  +  M°2)i  V  ^“^(1  +  ^  +  Mo))  ’  (19) 


where  a  V  b  =  max(a,  b )  and  we  used 


«ll/2 


,  i-7 

UfP) 


+  (l-«)ll/2 


,i_a 

2  < 

Il2(P')  - 


/4d(aP+(l-a)P') 


i(i-i) 


=  11  f 


,  i-7 

li2(S) 


by  Jensen’s  inequality  for  a  concave  function.  Since 


iipiims)  <  n/iiMsiksa+w+A®, 

the  right-hand  side  of  Eq.(19)  is  further  bounded  by 
l(S„,„.-S)(/2)| 

=Op  (Tll/llLhtI  +  R(s)-  +  M02)M  V  +  J?©)2  +  M2))  .  (20) 

Similarly,  we  can  show  that 

l(S„,„.-S)(s*(s*-s))| 

=Op  (411/11^(1  +  +  Ml) i  V  H-*(l  +  R(g)M0  +  M02))  ,  (21) 
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and 


,1-3 


l( Pn  -  PM  -  s)|  =  0P  I  -=||/||MJP)(1  +  R(g)  +  M0)i  V  +  fl®  +  M0) 


<  O,  (  ^ll/llL(|)(l  +  R(S)  +  M0)iMR  J>  V  n-*(l  +  fl®  +  M„)  ]  ,  (22) 


where  we  used 


W) 


f2dP=J  /  Pg*dS<M0\  /  /2dS 


in  the  last  inequality.  Combining  Eqs.(20),  (21),  and  (22),  we  can  bound  the  L2(.5,)-norm 
of  /  as 

1 11  “iLisi  +  vm2 


<  ^R(9*Y  +  Op  I  ^=11/11^(1  +  R{gf  +  M02)l+1  V  n-^(l  +  R(g)2  +  M2)  )  .  (23) 


n 

The  following  is  similar  to  the  argument  in  Theorem  10.6  in  (van  de  Geer,  2000),  but 
we  give  a  simpler  proof. 

By  Young’s  inequality,  we  have  a2_4&2+4  <  (1  —  3)a_|_  (1  _|_  2)fr  <  a  +  b  for  all  a,  b  >  0. 
Applying  this  relation  to  Eq.(23),  we  obtain 

)ll/llL(S)  +  A„fl®2 

<  KR(g’)2  +  op  (ll/llS;)’  {«"*(!  +  R(g)2  +  K)Y+i  V +  fl®2  +  M2) 

Young  1/2  2  \ 

<  KR(g*)2  +  ^||/|li2(s)  +  y1  2+7 (1  +  R(g)2  +  M^)  +  n  2+t(1  +  R(g)2  +  M^yj 


-  XnR(g*)  +  ^||/||l2(s)  +  Op  yn  2+-y  (1  +  R(g)  +M0)J, 
which  indicates 

^ll/lli2(s)  +  XnR(g)2  <  A nR(g*)2  +  op  (A„(l  +  R(g)2  +  M„)) . 

Therefore,  by  moving  op(XnR(j])2)  to  the  left  hind  side,  we  obtain 

^ll/llLfs)  +  ^(1  -  op(l)).R(g)2  <  C>p  (Afi(l  +  R(g*)2  +  Ml)) 

<Op(Xn(l  +  R(g*)2)). 

This  gives 

||/||l2(s)  =  Op(Xl  max{l,  R(g*)}), 

R{g)  =  OpiV1  +  R(g*)2)  =  Op(max{l,R(g*)}). 
Consequently,  the  proof  of  Lemma  1  was  completed. 
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A. 3  Proof  of  Theorem  1 


Based  on  Lemma  1,  we  prove  Theorem  1. 

As  in  the  proof  of  Lemma  1,  let  /  :=  g  —  g*.  Since  (aP  +  (1  —  a)P')(fg*)  =  S(fg*)  = 
Pf ,  we  have 


PE„  -  PE„  -  \sn„,g2  -  Png  -  (^Sg*2  -  Pg *) 

=  \sn,n,(f  +  g*)2  -  PM  +  g*)  -  Qsg*2  -  PgX 
=  \sf  +  -  S)f2  +  (S„,„.  -  S)(g*f)  -  (Pn  -  P)f 

+  l-(Sn,n,-S)g’2-(Png’-Pg’)-  (24) 

Below,  we  show  that  each  term  of  the  right-hand  side  of  the  above  equation  is  Op(Xn). 
By  the  central  limit  theorem,  we  have 

\(Sn,n/-  S)g*2-{Png*-Pg*) 

=  Op  (n~1/2M0  ((1  +  a)y/P(g*  -  Pg*)2  +  (1  -  a)y/P'(g*  -  P'c/*)2))  • 

Since  Lemma  1  gives  \\fW2  =  Op(X?  max(l,  R(g*)))  and  R(g)  =  Op(max(l,  R(g*))), 
Eqs.(20),  (21),  and  (22)  in  the  proof  of  Lemma  1  imply 


-  s)/2l  =  o„  (^ll/llLl)d  +  R(g‘))1+i  V  n-iPRMf) 

<  Op(Xn  max(l,  R(g*)2)), 

l(S„,„'  -  W/)l  =  0„  (-^11/llLyi  +  R(g)Ma  +  M2) i  V  n"*(l  +  i?(g)M„  +  M2)) 

<  0,( Af.  max(l,  R(g*)M$ ,  .  MaR(g’),  M2)) 


<  Op(Xn  max(l,  R(g*)M^  ,  M0R{g*))), 

<  Op(Xn  max(l,  R(g*)2)), 

1  _  7 

2  .m  0 

' n 

=  Op( \n  max(l,  Mp-V,  R^M2^ ,  R(g"))) 


I  (Pn  -  P)f\  <  o,  (  -7||/||Ls)(1  +  R<S)  +  M0)lMp  ?)  V  ft- A(1  +  R(g)  +  M„) 

(25) 


<  O p(Xn max(l,  R(g*)  )), 


where  we  used  A^1  =  o(n2^2+7-))  and  Mq  <  R(g*)-  Lemma  1  also  implies 


Sf2  =  \\f\\22  =  Op(Xnmax(l:R(g*)2)). 
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Combining  these  inequalities  with  Eq.(24)  implies 

PE„  —  PEa  =  Op( An  max(l,  R(g*)2)  +  n_1//2cM0), 

where  we  again  used  M0  <  R(g*). 

On  the  other  hand,  we  have 

PE«  -  PEa  =  ^ Png  -  ^ Pg * 

=  \  \(Pn  -  P)(S  -  »’)  +  P(g  -  S‘)  +  (Pn  -  P)V} 

Eq.(25)  gives 


(26) 


(Pn-P)(S-al  =0,(A„max(l,M|(1  i},R(g-)Mp  i],R(g'))). 

We  also  have 

P(9-9*)  <  \\9-9*\\l2(p)  <  \\g-g*\\L2{s)Ml  =  Op(\\  max(M* ,  m\ R(g*))), 

and 

(Pn  -  P)g*  =  Op(n~^  y/ P(g*  —  P g*)2)  <  Op(n~*MQ)  <  Op( \\  max(M* ,  M*R(g*))), 
Therefore  by  substituting  these  bounds  into  the  relation  (26),  one  observes  that 
PEq  —  PEq 

=Op( \\  max(M0^ ,  R(g*))  +  Afi  max(l,  M*M),  R(g*))).  (27) 

This  completes  the  proof.  ■ 


B  Technical  Details  of  Parametric  Variance  Analysis 

Here,  we  give  the  technical  details  of  the  parametric  variance  analysis  described  in  Sec¬ 
tion  3.2. 

B.l  Results 

For  the  estimation  of  the  ct-relative  density-ratio  (1),  the  statistical  model 

g  =  {g(x-G)  |  6  e  0  CM6} 

is  used  where  b  is  a  finite  number.  Let  us  consider  the  following  estimator  of  a-relative 
density-ratio, 

1  f  n  I  _  n'  i  n 

g  =  argmin  -  ^  J>0*h))2  +  — ^  J>(*'))2  [  -  -  J>(0- 

9&<3  ^  i= 1  i=l  '  i= 1 
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Suppose  that  the  model  is  correctly  specified,  i.e.,  there  exists  9*  such  that 


g(x-,9*)  =  ra(x). 


Then,  under  a  mild  assumption  (see  Theorem  5.23  of  van  der  Vaart,  2000),  the  estimator 
g  is  consistent  and  the  estimated  parameter  9  satisfies  the  asymptotic  normality  in  the 
large  sample  limit.  Then,  a  possible  estimator  of  the  o-relative  Pearson  divergence  PEa 

is 


PE  a 


1 


i 

2' 


Note  that  there  are  other  possible  estimators  for  PEQ  such  as 


2—1 


We  study  the  asymptotic  properties  of  PEa.  The  expectation  under  the  probability  p 
( p ')  is  denoted  as  Ep(a.)[-]  (Ep/(a.)[-]).  Likewise,  the  variance  is  denoted  as  Yp(a.)[-]  (Vp/ (.,.)[•]). 
Then,  we  have  the  following  theorem. 


Theorem  2  Let  ||r'||00  be  the  sup-norm  of  the  standard  density  ratio  r{x),  and  ||ra||oo  be 
the  sup-norm  of  the  a-relative  density  ratio,  i.e., 


'  a  oo  ||  ||  . 

a||r||oo  +  l  —  a 

The  variance  of  PEQ  is  denoted  as  V[PEQ].  Then,  under  the  regularity  condition  for  the 
asymptotic  normality,  we  have  the  following  upper  bound  o/V[PEa]: 

V[PEq]  =  -Yp(a.) 
n 

<  IKWL  +  «2|1^H^  +  (l-a)2|lra|l^  +  fl 

~  n  4 n  4 n’  \n’  n'  J 


art, 


rn  - 


+  ;YP'(  x) 
n 


(1  -  a)r2c 


1  1 

+  °(  — 5  — 

'  n  n 


Theorem  3  The  variance  of  PE a  is  denoted  as  V[PEQ].  Let  Vg  be  the  gradient  vector 
of  g  with  respect  to  9  at  9  =  9* ,  i.e.,  (Vg(£c;  9*))j  =  d9<y ^ .  The  matrix  U a  is  defined 
by 

Ua  =  aEp{x)[V gV gT]  +  (1  -  a)Epgx)[V g\7 gT}. 


Then,  under  the  regularity  condition,  the  variance  of  PEa  is  asymptotically  given  as 


V[PEQ] 


ra  +  (1  -  ara)Ep{x)[\7 g)TU al\/ g 


2 
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B.2  Proof  of  Theorem  2 


Let  9  be  the  estimated  parameter,  i.e. ,  g{x)  =  g(x]  9).  Suppose  that  ra(x)  =  g{x\  9*)  e  Q 
holds.  Let  59  =  9  —  9*,  then  the  asymptotic  expansion  of  PEQ  is  given  as 


1 

2 


1  i  (  i  _  1 

PE«  =  -  £  9)  -  2  {  2  £  »)2  +  -f-  £  §)2  I  - 

l  1=1  ^  1=1  j=  1  ' 

^  n  i  n 

=  PEa  +  -^2(ra(xi)  -Ep(a.)[ra])  +  -  g(xj]9*)T59 

71  i= 1  n  i=l 

1  f  ”  I  _  n' 

-  2  { l  £m*,)2  -  E»<*>K])  +  £w*p  - 

^  i= 1  i=l 

{n  n'  n  T 

^  ra{xi)\/g{x.L- 9*)  +  — ^  Y 0*)  f 

4  i=l  .7=1  1 

Let  us  define  the  linear  operator  G  as 

<?/  =  -L  £</(*,) -e »<«)[/])• 

vn  i=i 

Likewise,  the  operator  G'  is  dehned  for  the  samples  from  p' .  Then,  we  have 


n' 

as 

1 

n 

£< 

Jn 

y/n’  y/n' 


PEq  -  PEq 

=  —G(r  —  —r2) _ —  G'(l  ~  “r2) 

^  “  2  a)  ^ ^  2  a) 


+  {Ep(a,)[Vp]  -  aEp(x)[raVg]  -  (1  -  a)Ep^x)[raV g]}T  59  +  op  ( —=,  -2= 

Vvn  v« 


)  +  °p  (  —/=  i  —/=. 
v  vn' 


a 

'n  '  2,Q/  v'n'"  v  2 

The  second  equality  follows  from 

ep(*)[Vs]  -  aEp{x)[raVg}  -  (1  -  a)Ep,(3;)[rQV5f]  =  0. 
Then,  the  asymptotic  variance  is  given  as 


V[PEJ  =  —  Vp(sb) 
n 


a  9 

r“  “  2  r“ 


+  ni^p'(x) 


1  —  a  9 

- r 

2  a 


+  0(1,  L). 

'  n  n'  1 


(28) 


We  confirm  that  both  ra  —  and  - are  non-negative  and  increasing  functions  with 


^  /y»2  O  TV  /"]  1  Q!  ^2 

2ra  ana  ~ra 

respect  to  r  for  any  a  6  [0, 1].  Since  the  result  is  trivial  for  a  =  1,  we  suppose  0  <  a  <  1. 

a  2 
2  a 


The  function  ra  —  f  r2  is  represented  as 


a  , 

r«  -  2  r«  = 


r(ar  +  2  —  2a) 
2(ar  +  1  —  a)2  ’ 
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and  thus,  we  have  rQ  —  | r2  =  0  for  r  =  0.  In  addition,  the  derivative  is  equal  to 


d  r(ar  +  2  —  2a)  (1  —  a)5 


dr  2 (ar  +  1  —  a)2  ( ar  +  1  —  a)3  ’ 

which  is  positive  for  r  >  0  and  a  G  [0, 1).  Hence,  the  function  ra  —  |r2  is  non-negative  and 
increasing  with  respect  to  r.  Following  the  same  line,  we  see  that  is  non-negative 

and  increasing  with  respect  to  r.  Thus,  we  have  the  following  inequalities, 


a 


0  <  ra(x )  -  ra(x )2  <  || r, 


0  < 


1  —  a 


-rJx)  < 


1  —  a , 


(T  . .  -  -  2 

a||oo  7^  |  P  a  1 1  oo ) 


ot  Moo' 


2  v  '  —  2 
As  a  result,  upper  bounds  of  the  variances  in  Eq.(28)  are  given  as 

2 


~^p(x) 

V p'(x ) 


(X  o 

rVv  —  —  r„ 


1  —  a  9 

- rz 

2  a 


a 

—  I  IMloo  -  2  llra|loo  I  t 

r  ||4 

-  4 


Therefore,  the  following  inequality  holds, 


V[PEa]  <-(11^1100 

n\  2 


\  n  n' 


^  II  Q  II  OO  _|_ 

—  n 


.,2||r  114 

oul  00 


4n 


+ 


n'  4 

(1  —  a)2||r, 


I4 

OL  ||cxD 


4n' 


1  1 
oi  — 

n  n' 


which  completes  the  proof. 


B.3  Proof  of  Theorem  3 

The  estimator  0  is  the  optimal  solution  of  the  following  problem: 


mm 

6»e© 


n  11  -j  n 

—  ^  ag(Xi ;  0)2  +  —  ^(1  -  a)g(x'j ;  Of  -  -  ^  g(Xi]  0) 

i=l  j= 1  i= 1 


Then,  the  extremal  condition  yields  the  equation, 

n  1  n'  1  n 

|  g(x,-,  S)Vs(n;  0)  +  -XXL  V  9(z' ;  g)Vj(^;  g)  -  -  Vj(n;  g)  =  0. 

i=  1  j=l  z=l 

Let  SO  =  0  —  0* .  The  asymptotic  expansion  of  the  above  equation  around  0  =  0*  leads 
to 


a  1  u 

-  yWfc)  -  1  )Vg(Xi]  0*)  +  V  ra(x' ) Vg(x' ;  0*)  +  C/Q<S0  + 

i=i  j= l 


0„  I  -4=,  — !=  )  =  0. 


\7n’  Vn7 


Relative  Density-Ratio  Estimation 


40 


Therefore,  we  obtain 


56  =  ~^=G((  1  -  ar a)U  g) - -L=G'((  1  -  a)raUalWg)  +  op( ~^=,  iY 

y/n  y/n'  \Vn  Vn'J 


Next,  we  compute  the  asymptotic  expansion  of  PEa: 

_  l  i  n 

PEQ  — [rQ]  —  ^^(ra(xf)  Ep^)  [r^] ) 


2—1 


1 

2  n 


Y/Vg(xi]6*)T56--  +  or, 


i= 1 


1  1 


2  '  PW^’  Vri 

=  PEq  +  J— G(ra )  +  -Ep(a,)[V(7]T^  +  Op  |  — y=,  —j= 
Zy/n  z  \vn  Vn 

Substituting  56  into  the  above  expansion,  we  have 


PEa  -  PEa  = 


1 


G(ra  +  (1  -  ar  a)Kp(x/\7  g]T  U  alX7  g) 


2  y/n 
1 


2y/n' 


=G'((  1  -  a)r aEp^lV g]T U avg)  +  op 


1  1 


n  y/n' 


As  a  result,  we  have 


V[PEq]  =  — Vp(a;) 
n 


h — ~y  v' 


ra  +  (1  -  arQ)Ep(3;)[V5f]Tf7Q1V^ 


n‘ 


,  vp'0) 


(1  -  a)rQEp(a;)[V^]TC/Q1V5f 


+  °(  —  1  > 

'  n  n  1 


which  completes  the  proof. 
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